-
Notifications
You must be signed in to change notification settings - Fork 172
Description
🚀 The feature
Add worker_init_fn support for ParallelMapper (and maybe persistent_workers).
Motivation, pitch
Right now, there is no way to specify a custom worker_init_fn for the parallel mapping fur customizing the startup process. This means we cannot use ParallelMapper with process mode (since we often need to configure credentials, loggers, random seeds, etc).
We could also consider adding a flag for persistent_workers to avoid spinning up new processes on each epoch (or, if this is already implemented, make it clear in the https://docs.pytorch.org/data/main/migrate_to_nodes_from_utils.html#map-style-datasets section that this is the case), which can help avoid wasting time re-initializing the process.
Alternatives
No response
Additional context
I think it'd be helpful for the docs to also talk a bit about the relationship with torch.utils.data.get_worker_info -- I think that function is pretty commonly used for random seeding across workers, but it sounds like it might work with nodes?