Type : Feature Request / Question
Description :
During training, data loading workers may occasionally hang (e.g., due to temporarily unavailable data files). This blocks the training pipeline until the worker recovers. To improve fault tolerance, I propose adding a timeout parameter to automatically switch to another worker if a worker fails to load data within the specified time.
Example Scenario :
Worker 1 starts loading data but hangs due to I/O issues.
After timeout=30s, the system should terminate Worker 1 and assign the task to Worker 2.
Suggested Implementation :
Add a timeout parameter to the DataLoader configuration.
Use a watchdog mechanism to monitor worker activity and reassign tasks on timeout.