Explicit iterator shutdown #1482

nvski · 2025-05-02T18:24:10Z

Bug

Currently resetting ParallelMap kills threads with del self._it, which is not guaranteed to immediately call self._it.__del__ and thus ._shutdown method. It leads to an extra living read thread. This extra thread can consume extra items from self.source node, which will be missing from new self._it

I faced this issue with lightning.pytorch training. I could not figure out, if lightning just calls iter twice in a short time, or increases ref count of the old iterator object somehow, but as a matter of fact I have two running read threads, which breaks my pipeline at the first batch.

Fix

Add explicit self._it._shutdown() into reset method. This fixes described bug

divyanshk · 2025-05-02T19:34:25Z

This is awesome! We have been thinking of clean shutdown recently - check out this PR #1477

but as a matter of fact I have two running read threads

this seems weird and something related to the training framework. Curious how did you verify two running read threads ?

Can you test your workflow without this change and daemonic_reading=False ? I'm curious to see if that solves the problem, thanks!

divyanshk · 2025-05-02T19:35:00Z

Kicking off CI - and I will also test this out! Overall this seems a good change overall.

facebook-github-bot · 2025-05-02T19:37:49Z

@divyanshk has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

pytorch-bot · 2025-05-02T20:00:39Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1482

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

CI workflows being skipped on PR

❌ 1 Cancelled Job

As of commit 01511ff with merge base dbf04a9 ():

CANCELLED JOB - The following job was cancelled. Please retry:

Run StatefulDataLoader Tests / test (macos-latest, 3.12) (gh)

This comment was automatically generated by Dr. CI and updates every 15 minutes.

nvski · 2025-05-03T01:35:58Z

@divyanshk

surviving thread issue

Switching from v0.11.0 to main (with #1477 already merged) and setting daemonic_reading=False does not help

How I noticed this race condition: my pipeline expects groups of frames from the same video, and fails if frames are not consecutive. Debug job stopped at the exception before the older thread joined and I noticed extra _populate_queue thread which iterates through the same self.source concurrently

I suspect cyclic reference depriving self._it from being collected immediately, but I could not find it in my project. Might missed a cycle by myself, of have an unfinished generator frame somewhere, or debug session interfering, or something mysterious on the lightning side. Could not reproduce this behavior as a realistic unit test, unfortunately.

lightning reiteration problem

I invesigated the problem with double iter call and yes, lightning calls iter twice on my tn.Loader instance in _FitLoop.setup_data: L252 and L275. I tried to use tn.BaseNode instance as a dataloader directly and call .reset() on epoch restart manually, but it just postponed the extra surviving thread issue until the start of the second epoch.

There are some lightning issues and PRs on double reiteration

Lightning-AI/pytorch-lightning#19427
Lightning-AI/pytorch-lightning#20775
Lightning-AI/pytorch-lightning#20705

It also might be useful to add a flag in tn.Loader, which will skip self._it.reset() when self._it.__next__ was not called since the previous reset? Something like this draft. It would help avoiding epoch skipping, instant restarting after loading state from checkpoint, etc

divyanshk

This change looks good to me!

Explicit iterator shutdown

01511ff

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 2, 2025

divyanshk self-requested a review May 2, 2025 20:50

divyanshk approved these changes May 6, 2025

View reviewed changes

divyanshk mentioned this pull request May 6, 2025

remove daemon threads #1483

Merged

divyanshk merged commit 10b87d6 into pytorch:main May 6, 2025
62 of 64 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Explicit iterator shutdown #1482

Explicit iterator shutdown #1482

Uh oh!

nvski commented May 2, 2025

Uh oh!

divyanshk commented May 2, 2025 •

edited

Loading

Uh oh!

divyanshk commented May 2, 2025

Uh oh!

facebook-github-bot commented May 2, 2025

Uh oh!

pytorch-bot bot commented May 2, 2025 •

edited

Loading

Uh oh!

nvski commented May 3, 2025 •

edited

Loading

Uh oh!

divyanshk left a comment

Uh oh!

Uh oh!

Uh oh!

Explicit iterator shutdown #1482

Explicit iterator shutdown #1482

Uh oh!

Conversation

nvski commented May 2, 2025

Bug

Fix

Uh oh!

divyanshk commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

divyanshk commented May 2, 2025

Uh oh!

facebook-github-bot commented May 2, 2025

Uh oh!

pytorch-bot bot commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1482

❗ 1 Active SEVs

❌ 1 Cancelled Job

Uh oh!

nvski commented May 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

surviving thread issue

lightning reiteration problem

Uh oh!

divyanshk left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

divyanshk commented May 2, 2025 •

edited

Loading

pytorch-bot bot commented May 2, 2025 •

edited

Loading

nvski commented May 3, 2025 •

edited

Loading