torch.distributed.DistBackendError after rank 0 is loading weights.

### Please check that this issue hasn't been reported before.

- [x] I searched previous [Bug Reports](https://github.com/axolotl-ai-cloud/axolotl/labels/bug) didn't find any similar reports.

### Expected Behavior

I am training Qwen3 235B model on 4A100 and 860 gb RAM. Rank0 is loading weights that takes more than 30 mins, other gpus need to weight for more than 30 minutes. 
ddp_timeout: 7200  should increase the timeout



### Current behaviour

I am training Qwen3 235B model on 4A100 and 860 gb RAM. Rank0 is loading weights that takes more than 30 mins, other gpus need to weight for more than 30 minutes.  But the time taken to load the weights is greater than 30 mins and because of this I get a crash .
```torch.distributed.DistBackendError: [3] is setting up NCCL communicator and retrieving ncclUniqueId from [0] via c1
0d key-value store by key '0', but store->get('0') got error: wait timeout after 1800000ms, keys: /default_pg/0//cuda//0    
[rank3]: Exception raised from doWait at /pytorch/torch/csrc/distributed/c10d/TCPStore.cpp:597 (most recent call first):    
```
Why does setting ddp_timeout: 7200 is not having any affect? How to increase the timeout?

### Steps to reproduce

ddp_timeout to 7200 should increase the nccl timeout. But its not happening. 

### Config yaml

```yaml
base_model: /home/savitha/model/qwen3-235b-thinking-2507
trust_remote_code: true

load_in_4bit: true
adapter: qlora
bnb_4bit_quant_type: nf4
bnb_4bit_compute_dtype: bfloat16

lora_r: 16
lora_alpha: 32
lora_dropout: 0.05

lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
chat_template: qwen3

ddp_timeout: 7200          

fsdp_version: 2
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  activation_checkpointing: true
  use_orig_params: true
  offload_params: false                              
  sync_module_states: true
  cpu_ram_efficient_loading: true                    
  auto_wrap_policy: TRANSFORMER_BASED_WRAP
  transformer_layer_cls_to_wrap: Qwen3MoeDecoderLayer 
  state_dict_type: SHARDED_STATE_DICT
  reshard_after_forward: true                         
  limit_all_gather_buffer: true

bf16: true                     

gradient_checkpointing: false     
activation_offloading: false
gradient_checkpointing_kwargs:
  use_reentrant: false
flash_attention: true

output_dir: ./checkpoints/qwen3-stability
save_total_limit: 2
saves_per_epoch: 2
evals_per_epoch: 4
```

### Possible solution

_No response_

### Which Operating Systems are you using?

- [x] Linux
- [ ] macOS
- [ ] Windows

### Python Version

3.10

### axolotl branch-commit

main

### Acknowledgements

- [x] My issue title is concise, descriptive, and in title casing.
- [x] I have searched the existing issues to make sure this bug has not been reported yet.
- [x] I am using the latest version of axolotl.
- [x] I have provided enough information for the maintainers to reproduce and diagnose the issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

torch.distributed.DistBackendError after rank 0 is loading weights. #3443

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

torch.distributed.DistBackendError after rank 0 is loading weights. #3443

Description

Please check that this issue hasn't been reported before.

Expected Behavior

Current behaviour

Steps to reproduce

Config yaml

Possible solution

Which Operating Systems are you using?

Python Version

axolotl branch-commit

Acknowledgements

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions