Segmentation DDP Failure multiple trials

### Search before asking

- [x] I have searched the RF-DETR issues and found no similar bug report.


## Bug

Training won't start with DDP enabled for rf-detr segmentation 
DDP Trials Attempted

Below is every configuration I tried.

### Trial 1 - Basic DDP (torch.distributed.launch)
  python -m torch.distributed.launch --nproc_per_node=2 --use_env train_seg_1024.py
  
Result:
  RuntimeError: Expected to mark a variable ready only once.
  Parameter segmentation_head.bias has been marked as ready twice.
  
Crash occurs during:
  train_one_epoch()
  scaler.scale(losses).backward()
  
### Trial 2 - Using torchrun (recommended method)

  torchrun --nproc_per_node=2 train_seg_1024_ddp.py
  
Same error:
  segmentation_head.bias marked ready twice

### Trial 3 - Adjusted effective batch size

Tried preserving effective batch size:
```
    GPUs = 2
    batch_size = 4
    grad_accum = 2
    Effective = 16
    
    GPUs = 2
    batch_size = 2
    grad_accum = 4
    Effective = 16
```
Same crash.
This is not related to gradient accumulation.

### Trial 4 - Disabled evaluation completely

Changed:
  run_test=False
  fp16_eval=False
  early_stopping=False
  
Still crashes.
So this is not evaluation-related.

### Trial 5 -AMP On / Off
Tried both:
  amp=True
  amp=False
  
No change.
Still fails at backward pass.

### Trial 6 - Reduced num_workers

Tried:
  num_workers=6
  num_workers=4
  num_workers=2
  
No difference.
This is not a DataLoader issue.

### Trial 7 -Explicit local_rank GPU binding

Added:
  local_rank = int(os.environ.get("LOCAL_RANK", 0))
  torch.cuda.set_device(local_rank)
  
Still crashes.
So this is not a rank-to-GPU mapping issue.

### Trial 8 - Disabled wandb and tensorboard

wandb=False
tensorboard=False

No change.

### Trial 9 - Removed fp16_eval

Set:
  fp16_eval=False

Still crashes.

Exact Failure
Occurs during training backward under DDP:
    RuntimeError: Expected to mark a variable ready only once.
    Parameter at index 280 with name segmentation_head.bias has been marked as ready twice.
    
### Observations
Error occurs before evaluation starts.
Always references segmentation_head.bias.
Only happens in segmentation model.
Detection-only model does not show this behavior.
Happens during backward pass inside DDP.

### Interpretation
This PyTorch DDP error occurs when:
A parameter participates in multiple backward graphs in one iteration.
Re-entrant backward occurs.
A module parameter is reused outside forward.
DDP expects static graph but model graph is dynamic.
Since only segmentation_head.bias triggers this, it strongly suggests:
RF-DETR Segmentation head is not DDP-safe in version 1.4.3.

Likely Root Cause
  Segmentation adds:
  Mask head
  Auxiliary mask losses
  Multi-branch backward paths
  This likely causes multiple autograd hooks firing on the same parameter in a single iteration.
  
### PyTorch 2.x supports:
  DistributedDataParallel(..., static_graph=True)
  or:
  model._set_static_graph()
  
If the graph is static.
Currently this is not enabled internally.

Question
Is RF-DETR Segmentation officially supported under DDP in 1.4.3?
Should static_graph=True be enabled internally for segmentation?
Is there a known workaround for segmentation DDP training?

### Environment

```
RF-DETR Version: 1.4.3
Model: RFDETRSegMedium (Segmentation)
GPUs: 2 × RTX 5090
CUDA: 12.7
OS: Ubuntu 22.04
Launch method: torchrun --nproc_per_node=2
Dataset Details
Train images: 15,325
Val/Test images: 6,640
Training resolution: 432
```

### Minimal Reproducible Example
```py
from rfdetr import RFDETRSegMedium

model = RFDETRSegMedium()

model.train(
    dataset_dir="/workspace/SDMTP/dataset_root",
    output_dir="/workspace/rf-detr/runs_seg_M1_uni_crops_dataset_Medium_ddp/",
    epochs=30,
    batch_size=2,
    grad_accum_steps=4,
    amp=True,
    run_test=False,
    fp16_eval=False,
    eval_max_dets=100,
    num_workers=6,
)
```

`torchrun --nproc_per_node=2 train_seg_1024_ddp.py`

### Additional

Error occurs before train epoch starts starts
Always references segmentation_head.bias
Happens in segmentation model
Happens during backward pass inside DDP

exact error log i got in one of the runs
```
RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward
ap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable
 ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over ite
rations.
[rank0]: Parameter at index 280 with name segmentation_head.bias has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this partic
ular parameter during this iteration.
[rank1]: Traceback (most recent call last):
[rank1]:   File "/workspace/rf-detr/train_seg_ddp.py", line 44, in <module>
[rank1]:     model.train(
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/rfdetr/detr.py", line 97, in train
[rank1]:     self.train_from_config(config, **kwargs)
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/rfdetr/detr.py", line 238, in train_from_config
[rank1]:     self.model.train(
[rank1]:   File "/usr/local/lib/python3.12/dist-packages/rfdetr/main.py", line 360, in train
[rank1]:     train_stats = train_one_epoch(
rations.
[rank1]: Parameter at index 280 with name segmentation_head.bias has been marked as ready twice. This means that multiple autograd engine  hooks have fired for this partic
ular parameter during this iteration.
[rank0]:[W217 07:36:50.984191104 ProcessGroupNCCL.cpp:1538] Warning: WARNING: destroy_process_group() was not called before program exit, which can leak resources. For mor
e info, please see https://pytorch.org/docs/stable/distributed.html#shutdown (function operator())
W0217 07:36:50.975000 2242 torch/distributed/elastic/multiprocessing/api.py:900] Sending process 2308 closing signal SIGTERM
E0217 07:36:51.341000 2242 torch/distributed/elastic/multiprocessing/api.py:874] failed (exitcode: 1) local_rank: 0 (pid: 2307) of binary: /usr/local/bin/python
Traceback (most recent call last):
  File "/usr/local/bin/torchrun", line 7, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 901, in main
    run(args)
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/run.py", line 892, in run
    elastic_launch(
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 143, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/launcher/api.py", line 277, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_seg_ddp.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-02-17_07:36:50
  host      : b5381081e05b
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2307)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
```

### Are you willing to submit a PR?

- [ ] Yes, I'd like to help by submitting a PR!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmentation DDP Failure multiple trials #698

Search before asking

Bug

Trial 1 - Basic DDP (torch.distributed.launch)

Trial 2 - Using torchrun (recommended method)

Trial 3 - Adjusted effective batch size

Trial 4 - Disabled evaluation completely

Trial 5 -AMP On / Off

Trial 6 - Reduced num_workers

Trial 7 -Explicit local_rank GPU binding

Trial 8 - Disabled wandb and tensorboard

Trial 9 - Removed fp16_eval

Observations

Interpretation

PyTorch 2.x supports:

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Segmentation DDP Failure multiple trials #698

Description

Search before asking

Bug

Trial 1 - Basic DDP (torch.distributed.launch)

Trial 2 - Using torchrun (recommended method)

Trial 3 - Adjusted effective batch size

Trial 4 - Disabled evaluation completely

Trial 5 -AMP On / Off

Trial 6 - Reduced num_workers

Trial 7 -Explicit local_rank GPU binding

Trial 8 - Disabled wandb and tensorboard

Trial 9 - Removed fp16_eval

Observations

Interpretation

PyTorch 2.x supports:

Environment

Minimal Reproducible Example

Additional

Are you willing to submit a PR?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions