Skip to content

ParallelMapper missing a worker_init_fn #1494

@alanhdu

Description

@alanhdu

🚀 The feature

Add worker_init_fn support for ParallelMapper (and maybe persistent_workers).

Motivation, pitch

Right now, there is no way to specify a custom worker_init_fn for the parallel mapping fur customizing the startup process. This means we cannot use ParallelMapper with process mode (since we often need to configure credentials, loggers, random seeds, etc).

We could also consider adding a flag for persistent_workers to avoid spinning up new processes on each epoch (or, if this is already implemented, make it clear in the https://docs.pytorch.org/data/main/migrate_to_nodes_from_utils.html#map-style-datasets section that this is the case), which can help avoid wasting time re-initializing the process.

Alternatives

No response

Additional context

I think it'd be helpful for the docs to also talk a bit about the relationship with torch.utils.data.get_worker_info -- I think that function is pretty commonly used for random seeding across workers, but it sounds like it might work with nodes?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions