Skip to content

Explicit iterator shutdown #1482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 6, 2025
Merged

Explicit iterator shutdown #1482

merged 1 commit into from
May 6, 2025

Conversation

nvski
Copy link
Contributor

@nvski nvski commented May 2, 2025

Bug

Currently resetting ParallelMap kills threads with del self._it, which is not guaranteed to immediately call self._it.__del__ and thus ._shutdown method. It leads to an extra living read thread. This extra thread can consume extra items from self.source node, which will be missing from new self._it

I faced this issue with lightning.pytorch training. I could not figure out, if lightning just calls iter twice in a short time, or increases ref count of the old iterator object somehow, but as a matter of fact I have two running read threads, which breaks my pipeline at the first batch.

Fix

Add explicit self._it._shutdown() into reset method. This fixes described bug

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 2, 2025
@divyanshk
Copy link
Contributor

divyanshk commented May 2, 2025

This is awesome! We have been thinking of clean shutdown recently - check out this PR #1477

but as a matter of fact I have two running read threads

this seems weird and something related to the training framework. Curious how did you verify two running read threads ?

  1. Can you test your workflow without this change and daemonic_reading=False ? I'm curious to see if that solves the problem, thanks!

@divyanshk
Copy link
Contributor

Kicking off CI - and I will also test this out! Overall this seems a good change overall.

@facebook-github-bot
Copy link
Contributor

@divyanshk has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link

pytorch-bot bot commented May 2, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1482

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 1 Cancelled Job

As of commit 01511ff with merge base dbf04a9 (image):

CANCELLED JOB - The following job was cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@divyanshk divyanshk self-requested a review May 2, 2025 20:50
@nvski
Copy link
Contributor Author

nvski commented May 3, 2025

@divyanshk

surviving thread issue

Switching from v0.11.0 to main (with #1477 already merged) and setting daemonic_reading=False does not help

How I noticed this race condition: my pipeline expects groups of frames from the same video, and fails if frames are not consecutive. Debug job stopped at the exception before the older thread joined and I noticed extra _populate_queue thread which iterates through the same self.source concurrently

I suspect cyclic reference depriving self._it from being collected immediately, but I could not find it in my project. Might missed a cycle by myself, of have an unfinished generator frame somewhere, or debug session interfering, or something mysterious on the lightning side. Could not reproduce this behavior as a realistic unit test, unfortunately.

lightning reiteration problem

I invesigated the problem with double iter call and yes, lightning calls iter twice on my tn.Loader instance in _FitLoop.setup_data: L252 and L275. I tried to use tn.BaseNode instance as a dataloader directly and call .reset() on epoch restart manually, but it just postponed the extra surviving thread issue until the start of the second epoch.

There are some lightning issues and PRs on double reiteration

Lightning-AI/pytorch-lightning#19427
Lightning-AI/pytorch-lightning#20775
Lightning-AI/pytorch-lightning#20705

It also might be useful to add a flag in tn.Loader, which will skip self._it.reset() when self._it.__next__ was not called since the previous reset? Something like this draft. It would help avoiding epoch skipping, instant restarting after loading state from checkpoint, etc

Copy link
Contributor

@divyanshk divyanshk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good to me!

@divyanshk divyanshk mentioned this pull request May 6, 2025
@divyanshk divyanshk merged commit 10b87d6 into pytorch:main May 6, 2025
62 of 64 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants