Does FSDP v2 have the best performance?

Hi, when I set fsdp_reshard_after_forward: False, the training speed increased by approximately 5-7%（tokens_per_second_per_gpu）. Are there any other configurations that affect performance? Or where do you recommend referring to for configurations?

In addition, the setting of gradient_accumulation_steps does not affect the speed. Generally speaking, setting a larger value will reduce the frequency of communication and speed up the training. The model used in the experiment is Qwen 2.5 3B.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does FSDP v2 have the best performance? #2402

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development