Inconsistent Training Throughput Across Epochs #2449
Unanswered
caojiaolong
asked this question in
Q&A
Replies: 2 comments
-
I also plotted the average speed (epochs per minute) and noticed that the speed gradually decreases as training progresses—a very strange phenomenon. |
Beta Was this translation helpful? Give feedback.
0 replies
-
It appears that the training speed fluctuates—starting fast, then slowing down, and then speeding up again. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I attempted to reproduce MobileNetV4 using this [configuration](https://gist.github.com/rwightman/f6705cb65c03daeebca8aa129b1b94ad#file-mnv4_hm_r384_e550_ix_gpu8-yaml) on ImageNet-1K with 8×RTX 3090 GPUs and 100 CPUs. However, I noticed that certain epochs and iterations experience significantly lower throughput compared to others. Is this behavior expected?
In the screenshots below, the throughput during epochs 114 and 115 is noticeably lower than in epoch 116—around 2K images per second instead of the usual 4K. This slowdown occurs randomly in other epochs as well.
My training script:
Has anyone encountered similar issues or found a potential cause?
Beta Was this translation helpful? Give feedback.
All reactions