Skip to content

Issue: RuntimeError with SwinTransformer frozen_stages >= 0 in Distributed Training #12345

Open
@941227056

Description

@941227056

When using the SwinTransformer backbone in MMDetection and setting frozen_stages >= 0, I encounter the following error during distributed training:
RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss.
If I set frozen_stages=-1 (no frozen stages), the training proceeds without any issues. However, freezing any stage (e.g., frozen_stages=1) consistently triggers this error.

Reproduction Steps
Use the following configuration file:
model = dict( backbone=dict( type='SwinTransformer', embed_dims=96, depths=[2, 2, 6, 2], num_heads=[3, 6, 12, 24], window_size=7, out_indices=(1, 2, 3), frozen_stages=1, # Freezing the first stage init_cfg=dict(type='Pretrained', checkpoint=pretrained) ), neck=dict(in_channels=[192, 384, 768], start_level=0, num_outs=5) )
Run the training command:
CUDA_VISIBLE_DEVICES=2,3 bash ./tools/dist_train.sh ./configs/swin/retinanet_swin-t-p4-w7_fpn_1x_coco.py 2

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions