Skip to content

Does torchrun + FSDP create multiple copies of the same dataset and model? #1289

Open
@tsengalb99

Description

In the example T5 training code, the main function creates a copy of the model and dataset regardless of the worker rank before passing it to FSDP. Does this mean that there are n copies of the model and dataset when running the script with torchrun and n processes?

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions