Please check that this issue hasn't been reported before.
Expected Behavior
I am training Qwen3 235B model on 4A100 and 860 gb RAM. Rank0 is loading weights that takes more than 30 mins, other gpus need to weight for more than 30 minutes.
ddp_timeout: 7200 should increase the timeout
Current behaviour
I am training Qwen3 235B model on 4A100 and 860 gb RAM. Rank0 is loading weights that takes more than 30 mins, other gpus need to weight for more than 30 minutes. But the time taken to load the weights is greater than 30 mins and because of this I get a crash .
0d key-value store by key '0', but store->get('0') got error: wait timeout after 1800000ms, keys: /default_pg/0//cuda//0
[rank3]: Exception raised from doWait at /pytorch/torch/csrc/distributed/c10d/TCPStore.cpp:597 (most recent call first):
Why does setting ddp_timeout: 7200 is not having any affect? How to increase the timeout?
Steps to reproduce
ddp_timeout to 7200 should increase the nccl timeout. But its not happening.
Config yaml
base_model: /home/savitha/model/qwen3-235b-thinking-2507
trust_remote_code: true
load_in_4bit: true
adapter: qlora
bnb_4bit_quant_type: nf4
bnb_4bit_compute_dtype: bfloat16
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- k_proj
- v_proj
- o_proj
chat_template: qwen3
ddp_timeout: 7200
fsdp_version: 2
fsdp:
- full_shard
- auto_wrap
fsdp_config:
activation_checkpointing: true
use_orig_params: true
offload_params: false
sync_module_states: true
cpu_ram_efficient_loading: true
auto_wrap_policy: TRANSFORMER_BASED_WRAP
transformer_layer_cls_to_wrap: Qwen3MoeDecoderLayer
state_dict_type: SHARDED_STATE_DICT
reshard_after_forward: true
limit_all_gather_buffer: true
bf16: true
gradient_checkpointing: false
activation_offloading: false
gradient_checkpointing_kwargs:
use_reentrant: false
flash_attention: true
output_dir: ./checkpoints/qwen3-stability
save_total_limit: 2
saves_per_epoch: 2
evals_per_epoch: 4
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main
Acknowledgements
Please check that this issue hasn't been reported before.
Expected Behavior
I am training Qwen3 235B model on 4A100 and 860 gb RAM. Rank0 is loading weights that takes more than 30 mins, other gpus need to weight for more than 30 minutes.
ddp_timeout: 7200 should increase the timeout
Current behaviour
I am training Qwen3 235B model on 4A100 and 860 gb RAM. Rank0 is loading weights that takes more than 30 mins, other gpus need to weight for more than 30 minutes. But the time taken to load the weights is greater than 30 mins and because of this I get a crash .
Why does setting ddp_timeout: 7200 is not having any affect? How to increase the timeout?
Steps to reproduce
ddp_timeout to 7200 should increase the nccl timeout. But its not happening.
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main
Acknowledgements