ParallelMapper missing a `worker_init_fn`

### 🚀 The feature

Add `worker_init_fn` support for `ParallelMapper` (and maybe `persistent_workers`).

### Motivation, pitch

Right now, there is no way to specify a custom `worker_init_fn` for the parallel mapping fur customizing the startup process. This means we cannot use `ParallelMapper` with `process` mode (since we often need to configure credentials, loggers, random seeds, etc).

We could also consider adding a flag for `persistent_workers` to avoid spinning up new processes on each epoch (or, if this is already implemented, make it clear in the https://docs.pytorch.org/data/main/migrate_to_nodes_from_utils.html#map-style-datasets section that this is the case), which can help avoid wasting time re-initializing the process.

### Alternatives

_No response_

### Additional context

I think it'd be helpful for the docs to also talk a bit about the relationship with `torch.utils.data.get_worker_info` -- I think that function is pretty commonly used for random seeding across workers, but it sounds like it might work with nodes?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParallelMapper missing a `worker_init_fn` #1494

🚀 The feature

Motivation, pitch

Alternatives

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

ParallelMapper missing a worker_init_fn #1494

Description

🚀 The feature

Motivation, pitch

Alternatives

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

ParallelMapper missing a `worker_init_fn` #1494