Inconsistent Training Throughput Across Epochs #2449

caojiaolong · 2025-03-07T08:33:48Z

caojiaolong
Mar 7, 2025

I attempted to reproduce MobileNetV4 using this [configuration](https://gist.github.com/rwightman/f6705cb65c03daeebca8aa129b1b94ad#file-mnv4_hm_r384_e550_ix_gpu8-yaml) on ImageNet-1K with 8×RTX 3090 GPUs and 100 CPUs. However, I noticed that certain epochs and iterations experience significantly lower throughput compared to others. Is this behavior expected?

In the screenshots below, the throughput during epochs 114 and 115 is noticeably lower than in epoch 116—around 2K images per second instead of the usual 4K. This slowdown occurs randomly in other epochs as well.

My training script:

# export OMP_NUM_THREADS=2
# export MKL_NUM_THREADS=2
# export HF_DATASETS_IN_MEMORY_MAX_SIZE=50240000
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
sh distributed_train.sh 8 \
    --config ./configs/mobilenetv4/mnv4_hm_r384_e550_ix_gpu8.yaml \
    `# Dataset parameters` \
    --data-dir ./data/imagenet \
    $() \
    `# Model parameters` \
    --batch-size 192 \
    --channels-last True \
    $() \
    `# Device & distributed` \
    $() \
    `# Optimizer parameters` \
    $() \
    `# Learning rate schedule parameters` \
    $() \
    `# Augmentation & regularization parameters` \
    $() \
    `# Misc` \
    --workers 24 \
    --log-wandb

Has anyone encountered similar issues or found a potential cause?

caojiaolong · 2025-03-07T13:44:51Z

caojiaolong
Mar 7, 2025
Author

I also plotted the average speed (epochs per minute) and noticed that the speed gradually decreases as training progresses—a very strange phenomenon.

0 replies

caojiaolong · 2025-03-08T17:18:50Z

caojiaolong
Mar 8, 2025
Author

It appears that the training speed fluctuates—starting fast, then slowing down, and then speeding up again.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent Training Throughput Across Epochs #2449

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Inconsistent Training Throughput Across Epochs #2449

caojiaolong Mar 7, 2025

Replies: 2 comments

caojiaolong Mar 7, 2025 Author

caojiaolong Mar 8, 2025 Author

caojiaolong
Mar 7, 2025

caojiaolong
Mar 7, 2025
Author

caojiaolong
Mar 8, 2025
Author