-
Notifications
You must be signed in to change notification settings - Fork 166
Explicit iterator shutdown #1482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This is awesome! We have been thinking of clean shutdown recently - check out this PR #1477
this seems weird and something related to the training framework. Curious how did you verify two running read threads ?
|
Kicking off CI - and I will also test this out! Overall this seems a good change overall. |
@divyanshk has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1482
Note: Links to docs will display an error until the docs builds have been completed. ❗ 1 Active SEVsThere are 1 currently active SEVs. If your PR is affected, please view them below: ❌ 1 Cancelled JobAs of commit 01511ff with merge base dbf04a9 ( CANCELLED JOB - The following job was cancelled. Please retry:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
surviving thread issueSwitching from How I noticed this race condition: my pipeline expects groups of frames from the same video, and fails if frames are not consecutive. Debug job stopped at the exception before the older thread joined and I noticed extra I suspect cyclic reference depriving lightning reiteration problemI invesigated the problem with double There are some lightning issues and PRs on double reiteration Lightning-AI/pytorch-lightning#19427 It also might be useful to add a flag in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change looks good to me!
Bug
Currently resetting
ParallelMap
kills threads withdel self._it
, which is not guaranteed to immediately callself._it.__del__
and thus._shutdown
method. It leads to an extra living read thread. This extra thread can consume extra items fromself.source
node, which will be missing from newself._it
I faced this issue with
lightning.pytorch
training. I could not figure out, if lightning just callsiter
twice in a short time, or increases ref count of the old iterator object somehow, but as a matter of fact I have two running read threads, which breaks my pipeline at the first batch.Fix
Add explicit
self._it._shutdown()
intoreset
method. This fixes described bug