Problem Description
During training, critical training instability issues are observed:
Q_discriminator drops rapidly from -40 to -200+ and finally to around -300;
Total actor loss explodes to 7e4, far beyond the reasonable range;
The discriminator maintains extremely low loss but achieves over 96% accuracy in distinguishing expert samples from policy-generated samples, leading to lopsided training where the discriminator "dominates" the policy and prevents it from learning expert-like behaviors.
Core Questions
Why does Q_discriminator keep dropping rapidly and cause actor loss explosion even with extremely low discriminator learning rate?
Is it necessary to further constrain the discriminator's capability or adjust the actor loss weighting logic (set cfg.train.scale_reg to False)?
Problem Description
During training, critical training instability issues are observed:
Q_discriminator drops rapidly from -40 to -200+ and finally to around -300;
Total actor loss explodes to 7e4, far beyond the reasonable range;
The discriminator maintains extremely low loss but achieves over 96% accuracy in distinguishing expert samples from policy-generated samples, leading to lopsided training where the discriminator "dominates" the policy and prevents it from learning expert-like behaviors.
Core Questions
Why does Q_discriminator keep dropping rapidly and cause actor loss explosion even with extremely low discriminator learning rate?
Is it necessary to further constrain the discriminator's capability or adjust the actor loss weighting logic (set cfg.train.scale_reg to False)?