Skip to content

maisi: occuring NaN during the diffusion model training #1926

Open
@shengzhang90

Description

@shengzhang90

Hi,

When I train the diffusion model with the trained VAE autoencoder weights, I encounter the issue of NaN loss. The following is part log:

lr: [0.0001]
lr: [0.0001]
Epoch 201 train_vae_loss 0.039707845827617515: {'recons_loss': 0.015235490621573968, 'kl_loss': 85897.05040993346, 'p_loss': 0.05294216721683401}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
lr: [0.0001]
Epoch 202 train_vae_loss 0.036837538356057416: {'recons_loss': 0.013532256549082612, 'kl_loss': 84689.69741415161, 'p_loss': 0.049454373551865494}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 203 train_vae_loss 0.04066830881296869: {'recons_loss': 0.01579887273158354, 'kl_loss': 86930.33920508555, 'p_loss': 0.053921340536255344}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 204 train_vae_loss 0.04466726511873636: {'recons_loss': 0.017411270744553255, 'kl_loss': 86347.42582580798, 'p_loss': 0.06207083930534102}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 205 train_vae_loss 0.039213172850076874: {'recons_loss': 0.014744859089642877, 'kl_loss': 85096.7134475998, 'p_loss': 0.05319547471891338}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
lr: [0.0001]
Epoch 206 train_vae_loss 0.038383807665236705: {'recons_loss': 0.014200327562047841, 'kl_loss': 85760.27710610742, 'p_loss': 0.05202484130859375}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 207 train_vae_loss 0.03932591601622308: {'recons_loss': 0.014606543448346422, 'kl_loss': 87033.87854978612, 'p_loss': 0.05338661570966017}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
lr: [0.0001]
Epoch 208 train_vae_loss 0.058224454522186636: {'recons_loss': 0.022603081671181118, 'kl_loss': 143004.20642229088, 'p_loss': 0.07106984069592145}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 209 train_vae_loss 0.04279889678179953: {'recons_loss': 0.016451959535408723, 'kl_loss': 91421.0720725404, 'p_loss': 0.057349433463789214}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 210 train_vae_loss 0.04419510093541189: {'recons_loss': 0.017602585414566184, 'kl_loss': 88753.12170270912, 'p_loss': 0.05905734450191599}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 210 val_vae_loss nan: {'recons_loss': nan, 'kl_loss': nan, 'p_loss': nan}.
lr: [0.0001]
Epoch 211 train_vae_loss 0.047102449947657665: {'recons_loss': 0.018618881483757056, 'kl_loss': 99892.8082075808, 'p_loss': 0.061647625477141754}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 212 train_vae_loss 0.04109727745322826: {'recons_loss': 0.015783639634453017, 'kl_loss': 88632.56267823195, 'p_loss': 0.054834605169840185}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 213 train_vae_loss 0.04030142836448789: {'recons_loss': 0.015340764416353387, 'kl_loss': 87118.63764704135, 'p_loss': 0.054162667278101234}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 214 train_vae_loss 0.040564939826983476: {'recons_loss': 0.015572800811826333, 'kl_loss': 89774.16198312737, 'p_loss': 0.05338240938948134}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 215 train_vae_loss 0.04143597853471239: {'recons_loss': 0.01541382184043215, 'kl_loss': 96709.1665948788, 'p_loss': 0.054504133449307865}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 216 train_vae_loss 0.04318358794278379: {'recons_loss': 0.015925193886261985, 'kl_loss': 91295.16699590067, 'p_loss': 0.06042959118977246}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 217 train_vae_loss 0.05705382756280092: {'recons_loss': 0.020152890289860986, 'kl_loss': 109472.0783923479, 'p_loss': 0.08651243144568381}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
lr: [0.0001]
Epoch 218 train_vae_loss nan: {'recons_loss': nan, 'kl_loss': nan, 'p_loss': nan}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 219 train_vae_loss nan: {'recons_loss': nan, 'kl_loss': nan, 'p_loss': nan}.
lr: [0.0001]
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]
Epoch 220 train_vae_loss nan: {'recons_loss': nan, 'kl_loss': nan, 'p_loss': nan}.
Save trained autoencoder to ./models/vae/pretrain/20241128190854/autoencoder.pt
Save trained discriminator to ./models/vae/pretrain/20241128190854/discriminator.pt
lr: [0.0001]

Thanks a lot.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions