Skip to content

Model Not Converging (Issue with ControlNet Fine-Tuning Script) #1897

@MaybeRichard

Description

@MaybeRichard

Description:

Hello, I am encountering an issue with the fine-tuning script for ControlNet provided in this repository. Specifically, when fine-tuning the model on both the KiTS dataset and my custom dataset, the model seems unable to converge.

To Reproduce
Steps to reproduce the behavior:

  1. Go to 'generation/maisi'
  2. Run commands 'python -m scripts.train_controlnet -c ./configs/config_maisi.json -t ./configs/config_maisi_controlnet_train.json -e ./configs/environment_maisi_controlnet_train.json -g 1'

Problem Background:

  • Datasets: I am using the KiTS dataset as well as a custom medical image dataset.
  • Fine-tuning Script: I followed the steps outlined in the documentation for using the fine-tuning script.
  • Training Configuration: I used the default configuration and also tried different learning rates, batch sizes and weighted_loss.
  • Results: During training, the loss does not significantly decrease over multiple epochs(5k-8kepochs), and the model outputs show little to no variation.

Screenshots
Snipaste_2024-12-14_21-38-30

Output
[2024-12-14 17:52:38.415] INFO - load trained controlnet model from ./models/controlnet-20datasets-e20wl100fold0bc_noi_dia_fsize_current.pt
[2024-12-14 17:52:38.429] INFO - total number of training steps: 600.0.
[2024-12-14 17:52:38.430] INFO - apply weighted loss = 100 on labels: [129]
[Epoch 1/100] [Batch 1/6] [LR: 0.00000997] [loss: 0.0594] ETA: 0:00:57.509962
[Epoch 1/100] [Batch 2/6] [LR: 0.00000993] [loss: 0.0397] ETA: 0:00:05.898367
[Epoch 1/100] [Batch 3/6] [LR: 0.00000990] [loss: 0.0117] ETA: 0:00:04.425231
[Epoch 1/100] [Batch 4/6] [LR: 0.00000987] [loss: 0.0154] ETA: 0:00:02.951793
[Epoch 1/100] [Batch 5/6] [LR: 0.00000983] [loss: 0.0096] ETA: 0:00:01.475626
[Epoch 1/100] [Batch 6/6] [LR: 0.00000980] [loss: 0.0155] ETA: 0:00:00
[2024-12-14 17:52:58.030] INFO - best loss -> 0.02520664595067501.
[Epoch 2/100] [Batch 1/6] [LR: 0.00000977] [loss: 0.1253] ETA: 0:00:31.438770
[Epoch 2/100] [Batch 2/6] [LR: 0.00000974] [loss: 0.5812] ETA: 0:00:05.911908
[Epoch 2/100] [Batch 3/6] [LR: 0.00000970] [loss: 0.7666] ETA: 0:00:04.445841
[Epoch 2/100] [Batch 4/6] [LR: 0.00000967] [loss: 0.2461] ETA: 0:00:02.972570
[Epoch 2/100] [Batch 5/6] [LR: 0.00000964] [loss: 0.0693] ETA: 0:00:01.480636
[Epoch 2/100] [Batch 6/6] [LR: 0.00000960] [loss: 0.0123] ETA: 0:00:00
[Epoch 3/100] [Batch 1/6] [LR: 0.00000957] [loss: 0.3098] ETA: 0:00:32.303094
[Epoch 3/100] [Batch 2/6] [LR: 0.00000954] [loss: 0.4730] ETA: 0:00:06.003017
[Epoch 3/100] [Batch 3/6] [LR: 0.00000951] [loss: 0.2315] ETA: 0:00:04.474122
[Epoch 3/100] [Batch 4/6] [LR: 0.00000947] [loss: 0.2388] ETA: 0:00:02.972777
[Epoch 3/100] [Batch 5/6] [LR: 0.00000944] [loss: 0.4927] ETA: 0:00:01.482379
[Epoch 3/100] [Batch 6/6] [LR: 0.00000941] [loss: 0.6689] ETA: 0:00:00
[Epoch 4/100] [Batch 1/6] [LR: 0.00000938] [loss: 0.0097] ETA: 0:00:32.200603
[Epoch 4/100] [Batch 2/6] [LR: 0.00000934] [loss: 0.2147] ETA: 0:00:05.911577
[Epoch 4/100] [Batch 3/6] [LR: 0.00000931] [loss: 0.1359] ETA: 0:00:04.428170
[Epoch 4/100] [Batch 4/6] [LR: 0.00000928] [loss: 0.1607] ETA: 0:00:02.957243
[Epoch 4/100] [Batch 5/6] [LR: 0.00000925] [loss: 0.2544] ETA: 0:00:01.484895
[Epoch 4/100] [Batch 6/6] [LR: 0.00000922] [loss: 0.1566] ETA: 0:00:00
[Epoch 5/100] [Batch 1/6] [LR: 0.00000918] [loss: 0.3608] ETA: 0:00:39.206282
[Epoch 5/100] [Batch 2/6] [LR: 0.00000915] [loss: 0.3654] ETA: 0:00:05.935111
[Epoch 5/100] [Batch 3/6] [LR: 0.00000912] [loss: 0.1226] ETA: 0:00:04.445168
[Epoch 5/100] [Batch 4/6] [LR: 0.00000909] [loss: 0.4570] ETA: 0:00:02.968678
[Epoch 5/100] [Batch 5/6] [LR: 0.00000906] [loss: 0.1441] ETA: 0:00:01.491980
[Epoch 5/100] [Batch 6/6] [LR: 0.00000903] [loss: 0.2887] ETA: 0:00:00
[Epoch 6/100] [Batch 1/6] [LR: 0.00000899] [loss: 0.3012] ETA: 0:00:33.762065
[Epoch 6/100] [Batch 2/6] [LR: 0.00000896] [loss: 0.7926] ETA: 0:00:06.078450
[Epoch 6/100] [Batch 3/6] [LR: 0.00000893] [loss: 0.0092] ETA: 0:00:04.465131
[Epoch 6/100] [Batch 4/6] [LR: 0.00000890] [loss: 0.2825] ETA: 0:00:02.993035
[Epoch 6/100] [Batch 5/6] [LR: 0.00000887] [loss: 0.1342] ETA: 0:00:01.493944
[Epoch 6/100] [Batch 6/6] [LR: 0.00000884] [loss: 0.0131] ETA: 0:00:00
[Epoch 7/100] [Batch 1/6] [LR: 0.00000880] [loss: 0.6823] ETA: 0:00:36.323832
[Epoch 7/100] [Batch 2/6] [LR: 0.00000877] [loss: 0.0610] ETA: 0:00:06.089784
[Epoch 7/100] [Batch 3/6] [LR: 0.00000874] [loss: 0.5895] ETA: 0:00:04.482495
[Epoch 7/100] [Batch 4/6] [LR: 0.00000871] [loss: 0.0163] ETA: 0:00:02.972429
[Epoch 7/100] [Batch 5/6] [LR: 0.00000868] [loss: 0.7321] ETA: 0:00:01.492317
[Epoch 7/100] [Batch 6/6] [LR: 0.00000865] [loss: 0.6678] ETA: 0:00:00
[Epoch 8/100] [Batch 1/6] [LR: 0.00000862] [loss: 0.1816] ETA: 0:00:30.358437
[Epoch 8/100] [Batch 2/6] [LR: 0.00000859] [loss: 0.3500] ETA: 0:00:05.981302
[Epoch 8/100] [Batch 3/6] [LR: 0.00000856] [loss: 0.1950] ETA: 0:00:04.491425
[Epoch 8/100] [Batch 4/6] [LR: 0.00000853] [loss: 0.0150] ETA: 0:00:02.980718
[Epoch 8/100] [Batch 5/6] [LR: 0.00000849] [loss: 0.6160] ETA: 0:00:01.494416
[Epoch 8/100] [Batch 6/6] [LR: 0.00000846] [loss: 0.7288] ETA: 0:00:00
[Epoch 9/100] [Batch 1/6] [LR: 0.00000843] [loss: 0.2774] ETA: 0:00:30.263555
[Epoch 9/100] [Batch 2/6] [LR: 0.00000840] [loss: 0.2248] ETA: 0:00:05.925602
[Epoch 9/100] [Batch 3/6] [LR: 0.00000837] [loss: 0.1062] ETA: 0:00:04.446892
[Epoch 9/100] [Batch 4/6] [LR: 0.00000834] [loss: 0.7962] ETA: 0:00:02.997850
[Epoch 9/100] [Batch 5/6] [LR: 0.00000831] [loss: 0.3089] ETA: 0:00:01.493284
[Epoch 9/100] [Batch 6/6] [LR: 0.00000828] [loss: 0.5279] ETA: 0:00:00
[Epoch 10/100] [Batch 1/6] [LR: 0.00000825] [loss: 0.1299] ETA: 0:00:35.395999
[Epoch 10/100] [Batch 2/6] [LR: 0.00000822] [loss: 0.5649] ETA: 0:00:05.970950
[Epoch 10/100] [Batch 3/6] [LR: 0.00000819] [loss: 0.3622] ETA: 0:00:04.472051
[Epoch 10/100] [Batch 4/6] [LR: 0.00000816] [loss: 0.5737] ETA: 0:00:02.999292
[Epoch 10/100] [Batch 5/6] [LR: 0.00000813] [loss: 0.1142] ETA: 0:00:01.490247
[Epoch 10/100] [Batch 6/6] [LR: 0.00000810] [loss: 0.0369] ETA: 0:00:00
[Epoch 11/100] [Batch 1/6] [LR: 0.00000807] [loss: 0.7214] ETA: 0:00:29.923871
[Epoch 11/100] [Batch 2/6] [LR: 0.00000804] [loss: 0.0399] ETA: 0:00:06.025214
[Epoch 11/100] [Batch 3/6] [LR: 0.00000801] [loss: 0.0103] ETA: 0:00:04.477165
[Epoch 11/100] [Batch 4/6] [LR: 0.00000798] [loss: 0.3072] ETA: 0:00:02.993378
[Epoch 11/100] [Batch 5/6] [LR: 0.00000795] [loss: 0.5458] ETA: 0:00:01.499980
[Epoch 11/100] [Batch 6/6] [LR: 0.00000792] [loss: 0.2169] ETA: 0:00:00
[Epoch 12/100] [Batch 1/6] [LR: 0.00000789] [loss: 0.0542] ETA: 0:00:35.199655
[Epoch 12/100] [Batch 2/6] [LR: 0.00000786] [loss: 0.1887] ETA: 0:00:06.034812
[Epoch 12/100] [Batch 3/6] [LR: 0.00000783] [loss: 0.0109] ETA: 0:00:04.507118
[Epoch 12/100] [Batch 4/6] [LR: 0.00000780] [loss: 0.2449] ETA: 0:00:03.011293
[Epoch 12/100] [Batch 5/6] [LR: 0.00000777] [loss: 0.7797] ETA: 0:00:01.508624
[Epoch 12/100] [Batch 6/6] [LR: 0.00000774] [loss: 0.1621] ETA: 0:00:00
[Epoch 73/100] [Batch 4/6] [LR: 0.00000075] [loss: 0.0447] ETA: 0:00:03.022383
[Epoch 73/100] [Batch 5/6] [LR: 0.00000074] [loss: 0.4118] ETA: 0:00:01.515217
[Epoch 73/100] [Batch 6/6] [LR: 0.00000073] [loss: 0.4299] ETA: 0:00:00
[Epoch 74/100] [Batch 1/6] [LR: 0.00000072] [loss: 0.0393] ETA: 0:00:31.598071
[Epoch 74/100] [Batch 2/6] [LR: 0.00000071] [loss: 0.0467] ETA: 0:00:06.056095
[Epoch 74/100] [Batch 3/6] [LR: 0.00000070] [loss: 0.7438] ETA: 0:00:04.549044
[Epoch 74/100] [Batch 4/6] [LR: 0.00000069] [loss: 0.7230] ETA: 0:00:03.040020
[Epoch 74/100] [Batch 5/6] [LR: 0.00000068] [loss: 0.0777] ETA: 0:00:01.519922
[Epoch 74/100] [Batch 6/6] [LR: 0.00000068] [loss: 0.0102] ETA: 0:00:00
[Epoch 75/100] [Batch 1/6] [LR: 0.00000067] [loss: 0.5960] ETA: 0:00:32.156808
[Epoch 75/100] [Batch 2/6] [LR: 0.00000066] [loss: 0.6143] ETA: 0:00:06.073716
[Epoch 75/100] [Batch 3/6] [LR: 0.00000065] [loss: 0.1679] ETA: 0:00:04.548900
[Epoch 75/100] [Batch 4/6] [LR: 0.00000064] [loss: 0.3282] ETA: 0:00:03.040982

Environment (please complete the following information):

  • OS: Ubuntu 22.04
  • Python version: 3.11
  • MONAI version: 1.4.0
  • CUDA/cuDNN version: 12.3
  • GPU models and configuration: 4090 and A100(80G)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions