Skip to content

torch.distributed.DistBackendError after rank 0 is loading weights. #3443

Description

@savitha-suresh

Please check that this issue hasn't been reported before.

  • I searched previous Bug Reports didn't find any similar reports.

Expected Behavior

I am training Qwen3 235B model on 4A100 and 860 gb RAM. Rank0 is loading weights that takes more than 30 mins, other gpus need to weight for more than 30 minutes.
ddp_timeout: 7200 should increase the timeout

Current behaviour

I am training Qwen3 235B model on 4A100 and 860 gb RAM. Rank0 is loading weights that takes more than 30 mins, other gpus need to weight for more than 30 minutes. But the time taken to load the weights is greater than 30 mins and because of this I get a crash .

0d key-value store by key '0', but store->get('0') got error: wait timeout after 1800000ms, keys: /default_pg/0//cuda//0    
[rank3]: Exception raised from doWait at /pytorch/torch/csrc/distributed/c10d/TCPStore.cpp:597 (most recent call first):    

Why does setting ddp_timeout: 7200 is not having any affect? How to increase the timeout?

Steps to reproduce

ddp_timeout to 7200 should increase the nccl timeout. But its not happening.

Config yaml

base_model: /home/savitha/model/qwen3-235b-thinking-2507
trust_remote_code: true

load_in_4bit: true
adapter: qlora
bnb_4bit_quant_type: nf4
bnb_4bit_compute_dtype: bfloat16

lora_r: 16
lora_alpha: 32
lora_dropout: 0.05

lora_target_modules:
  - q_proj
  - k_proj
  - v_proj
  - o_proj
chat_template: qwen3

ddp_timeout: 7200          

fsdp_version: 2
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  activation_checkpointing: true
  use_orig_params: true
  offload_params: false                              
  sync_module_states: true
  cpu_ram_efficient_loading: true                    
  auto_wrap_policy: TRANSFORMER_BASED_WRAP
  transformer_layer_cls_to_wrap: Qwen3MoeDecoderLayer 
  state_dict_type: SHARDED_STATE_DICT
  reshard_after_forward: true                         
  limit_all_gather_buffer: true

bf16: true                     

gradient_checkpointing: false     
activation_offloading: false
gradient_checkpointing_kwargs:
  use_reentrant: false
flash_attention: true

output_dir: ./checkpoints/qwen3-stability
save_total_limit: 2
saves_per_epoch: 2
evals_per_epoch: 4

Possible solution

No response

Which Operating Systems are you using?

  • Linux
  • macOS
  • Windows

Python Version

3.10

axolotl branch-commit

main

Acknowledgements

  • My issue title is concise, descriptive, and in title casing.
  • I have searched the existing issues to make sure this bug has not been reported yet.
  • I am using the latest version of axolotl.
  • I have provided enough information for the maintainers to reproduce and diagnose the issue.

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions