Skip to content

Does FSDP v2 have the best performance? #2402

Open
@dz1iang

Description

Hi, when I set fsdp_reshard_after_forward: False, the training speed increased by approximately 5-7%(tokens_per_second_per_gpu). Are there any other configurations that affect performance? Or where do you recommend referring to for configurations?

In addition, the setting of gradient_accumulation_steps does not affect the speed. Generally speaking, setting a larger value will reduce the frequency of communication and speed up the training. The model used in the experiment is Qwen 2.5 3B.

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions