Open
Description
Hi, when I set fsdp_reshard_after_forward: False, the training speed increased by approximately 5-7%(tokens_per_second_per_gpu). Are there any other configurations that affect performance? Or where do you recommend referring to for configurations?
In addition, the setting of gradient_accumulation_steps does not affect the speed. Generally speaking, setting a larger value will reduce the frequency of communication and speed up the training. The model used in the experiment is Qwen 2.5 3B.
Activity