-
Notifications
You must be signed in to change notification settings - Fork 5
Description
2022-08-10 22:49:58,911 - edit - INFO - training gpus num: 2
2022-08-10 22:49:58,912 - edit - INFO - init distributed process group 0 / 2
2022-08-10 22:49:58,915 - edit - INFO - init distributed process group 1 / 2
2022-08-10 23:16:08,691 - rank0_edit - INFO - SRManyToManyDataset dataset load ok, mode: train len:24000
2022-08-10 23:16:08,691 - rank0_edit - INFO - use repeatdataset, repeat times: 1
2022-08-10 23:16:08,694 - rank0_edit - INFO - model: BasicVSR_v5 's total parameter nums: 23371
2022-08-10 23:16:08,698 - rank0_edit - INFO - syncing the model's parameters...
2022-08-10 23:16:08,991 - rank0_edit - INFO - SRManyToManyDataset dataset load ok, mode: eval len:3000
2022-08-10 23:16:08,992 - rank0_edit - INFO - 1500 iters for one epoch, trained iters: 0, total iters: 600000
2022-08-10 23:16:08,992 - rank0_edit - INFO - Start running, work_dir: ./workdirs/mai_training/20220810_224958, workflow: train, max epochs : 400
2022-08-10 23:16:08,992 - rank0_edit - INFO - registered hooks: [<edit.core.hook.logger.text.TextLoggerHook object at 0x7f89e5a0a290>, <edit.core.hook.checkpoint.checkpoint.CheckpointHook object at 0x7f89e5a0a2d0>, <edit.core.hook.evaluation.eval_hooks.EvalIterHook object at 0x7f89e7f49750>]
2022-08-10 23:16:25,548 - rank0_edit - INFO - epoch: 0, losses: [0.00301], losses_ma: [0.00301], iter: 4
2022-08-10 23:16:34,997 - rank0_edit - INFO - epoch: 0, losses: [0.00328], losses_ma: [0.00314], iter: 9
2022-08-10 23:16:44,411 - rank0_edit - INFO - epoch: 0, losses: [0.00399], losses_ma: [0.00343], iter: 14
2022-08-10 23:16:52,555 - rank0_edit - INFO - epoch: 0, losses: [0.00506], losses_ma: [0.00383], iter: 19
2022-08-10 23:17:02,273 - rank0_edit - INFO - epoch: 0, losses: [0.00280], losses_ma: [0.00363], iter: 24
2022-08-10 23:17:10,873 - rank0_edit - INFO - epoch: 0, losses: [0.00386], losses_ma: [0.00367], iter: 29
2022-08-10 23:17:19,736 - rank0_edit - INFO - epoch: 0, losses: [0.00313], losses_ma: [0.00359], iter: 34
2022-08-10 23:17:28,804 - rank0_edit - INFO - epoch: 0, losses: [0.00358], losses_ma: [0.00359], iter: 39
2022-08-10 23:17:38,345 - rank0_edit - INFO - epoch: 0, losses: [0.00376], losses_ma: [0.00361], iter: 44
2022-08-10 23:17:46,827 - rank0_edit - INFO - epoch: 0, losses: [0.00311], losses_ma: [0.00356], iter: 49
2022-08-10 23:17:55,623 - rank0_edit - INFO - epoch: 0, losses: [0.00406], losses_ma: [0.00360], iter: 54
2022-08-10 23:18:04,857 - rank0_edit - INFO - epoch: 0, losses: [0.00326], losses_ma: [0.00357], iter: 59
2022-08-10 23:18:14,276 - rank0_edit - INFO - epoch: 0, losses: [0.00368], losses_ma: [0.00358], iter: 64
2022-08-10 23:18:23,079 - rank0_edit - INFO - epoch: 0, losses: [0.00392], losses_ma: [0.00361], iter: 69
2022-08-10 23:18:31,822 - rank0_edit - INFO - epoch: 0, losses: [0.00393], losses_ma: [0.00363], iter: 74
2022-08-10 23:18:40,337 - rank0_edit - INFO - epoch: 0, losses: [0.00429], losses_ma: [0.00367], iter: 79
2022-08-10 23:18:50,203 - rank0_edit - INFO - epoch: 0, losses: [0.00378], losses_ma: [0.00368], iter: 84
2022-08-10 23:18:58,560 - rank0_edit - INFO - epoch: 0, losses: [0.00336], losses_ma: [0.00366], iter: 89
2022-08-10 23:19:08,135 - rank0_edit - INFO - epoch: 0, losses: [0.00406], losses_ma: [0.00368], iter: 94
2022-08-10 23:19:16,653 - rank0_edit - INFO - epoch: 0, losses: [0.00396], losses_ma: [0.00369], iter: 99
2022-08-10 23:19:26,492 - rank0_edit - INFO - epoch: 0, losses: [0.00315], losses_ma: [0.00367], iter: 104
2022-08-10 23:19:34,738 - rank0_edit - INFO - epoch: 0, losses: [0.00389], losses_ma: [0.00368], iter: 109
2022-08-10 23:19:43,368 - rank0_edit - INFO - epoch: 0, losses: [0.00297], losses_ma: [0.00365], iter: 114
2022-08-10 23:19:52,057 - rank0_edit - INFO - epoch: 0, losses: [0.00436], losses_ma: [0.00368], iter: 119
2022-08-10 23:20:02,484 - rank0_edit - INFO - epoch: 0, losses: [0.00338], losses_ma: [0.00367], iter: 124
2022-08-10 23:20:11,059 - rank0_edit - INFO - epoch: 0, losses: [0.00386], losses_ma: [0.00367], iter: 129
2022-08-10 23:20:20,270 - rank0_edit - INFO - epoch: 0, losses: [651753408617431171072.00000], losses_ma: [24139015133978931200.00000], iter: 134
2022-08-10 23:20:28,060 - rank0_edit - INFO - epoch: 0, losses: [nan], losses_ma: [nan], iter: 139
The losses seems to start low and then it goes to nan.
For training i used -
python tools/train.py configs/restorers/BasicVSR/mai.py --gpuids 0,1 -d