Description
When using the SwinTransformer backbone in MMDetection and setting frozen_stages >= 0, I encounter the following error during distributed training:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss.
If I set frozen_stages=-1 (no frozen stages), the training proceeds without any issues. However, freezing any stage (e.g., frozen_stages=1) consistently triggers this error.
Reproduction Steps
Use the following configuration file:
model = dict( backbone=dict( type='SwinTransformer', embed_dims=96, depths=[2, 2, 6, 2], num_heads=[3, 6, 12, 24], window_size=7, out_indices=(1, 2, 3), frozen_stages=1, # Freezing the first stage init_cfg=dict(type='Pretrained', checkpoint=pretrained) ), neck=dict(in_channels=[192, 384, 768], start_level=0, num_outs=5) )
Run the training command:
CUDA_VISIBLE_DEVICES=2,3 bash ./tools/dist_train.sh ./configs/swin/retinanet_swin-t-p4-w7_fpn_1x_coco.py 2