Adding Cycler, Header #1461

keunwoochoi · 2025-03-07T18:48:49Z

Please read through our contribution guide prior to
creating your pull request.

If you are adding a new node, ensure you read that section in the contribution guide, as it includes requirements for
functionality and testing.

Following up discussion #1452 and my previous PR #1454

Changes

Adds Cycler, Header, Shuffler and their tests.

divyanshk · 2025-03-12T18:32:16Z

Thanks for the PR, reviewing....

Also kicking off the CI.

divyanshk · 2025-03-12T18:41:06Z

torchdata/nodes/cycler.py

+            self._num_cycles += 1
+            self.source.reset(None)
+
+            # Try again - if it's empty, this will raise StopIteration


At this point, the source shouldn't be empty after the reset. If it was empty it would raise in line 64.

If this makes sense, we can update the comment.

it is an edge-case where the source node is blank. i think that's a fair consideration (e.g., the current Iterable node also can take a blank list without any error during instantiation).

divyanshk · 2025-03-12T18:45:08Z

torchdata/nodes/cycler.py

+        """Get the current state of the node.
+
+        Returns:
+            A dictionary containing the state of the source node and number of cycles completed.


Nit: Can we make it "Dict[str, Any] - A dictionary containing the state of the source node and number of cycles completed."? Ditto in other nodes in the PR, thanks.

divyanshk · 2025-03-12T18:47:32Z

torchdata/nodes/cycler.py

+            self._num_cycles = initial_state.get(self.NUM_CYCLES_KEY, 0)
+            self._has_started = initial_state.get(self.HAS_STARTED_KEY, False)


I wonder if we should not have default values here. If the state is setup wrongly this can lead to unexpected behavior.

agreed, working on it in the next commit.

pytorch-bot · 2025-03-12T18:48:22Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1461

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit f5c3b4a with merge base d349d80 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

torchdata/nodes/cycler.py

torchdata/nodes/header.py

divyanshk · 2025-03-12T18:55:55Z

torchdata/nodes/shuffler.py

+            self.RNG_STATE_KEY: self.rng.getstate(),
+            self.BUFFER_KEY: list(self.buffer),
+            self.NUM_SHUFFLED_KEY: self._num_shuffled,
+            self.RANDOM_STATE_KEY: self.rng.getstate(),


RANDOM_STATE_KEY is a duplicate of RNG_STATE_KEY, let's remove RANDOM_STATE_KEY

divyanshk · 2025-03-12T19:00:20Z

torchdata/nodes/shuffler.py

+            while len(self.buffer) < self.buffer_size:
+                self.buffer.append(next(self.source))
+            return True


This return True even when we do not enter the while loop. Is that expected ?

Do we want _fill_buffer() to return true if there are elements in the buffer or the call to the function led to elements being added ?

good catch! i've updated the method to explicitly handle when the buffer is already full.

the return value now clearly indicates whether the buffer has items after the call (True) or is empty (False). this matches how it's used in next() where we only raise StopIteration if both the buffer is empty and fill_buffer() returns False.

divyanshk · 2025-03-12T19:58:28Z

torchdata/nodes/shuffler.py

+    RNG_STATE_KEY = "rng_state"
+    BUFFER_KEY = "buffer"
+    NUM_SHUFFLED_KEY = "num_shuffled"
+    RANDOM_STATE_KEY = "random_state"


Lets include num yielded in state as well.

divyanshk · 2025-03-12T20:00:41Z

torchdata/nodes/shuffler.py

+        return {
+            self.SOURCE_KEY: self.source.state_dict(),
+            self.RNG_STATE_KEY: self.rng.getstate(),
+            self.BUFFER_KEY: list(self.buffer),


Should we avoid keeping the entire buffer as state ? since this would be a list of node objects it might not make sense.

makes sense. updated in the next commit.

divyanshk · 2025-03-12T20:05:44Z

torchdata/nodes/shuffler.py

+
+    Args:
+        source_node (BaseNode[T]): The source node to pull items from.
+        buffer_size (int): Size of the buffer used for shuffling. Must be at least 1.


Thinking from a user POV. Would they have a strong opinion on what the buffer_size argument should be? Maybe not.

I fear this would end up being set as an arbitrarily large number just to maximize shuffling capacity.

Should we see if we can make it work without buffer_size and seed as an argument? We have something like this in MultiNodeWeightedSampler (link below)

Update: updated link: https://github.com/pytorch/data/blob/main/torchdata/nodes/samplers/multi_node_weighted_sampler.py#L223

from my experience, i actually believe we should definitely let user control this. the ideal shuffle buffer size depends a lot per data source in my use-cases.
for example, if a data source is not pre-shuffled globally and only sharded (which sucks but it would a good use-case of Shuffler,) the buffer size should be as large as the number of items in each shard.

if the buffer size is too large, yes a lot of other issues can occur. but perhaps that’s up to users really and they should understand how Shuffler works? especially since i’m not really sure there’s a non-trivial method to set a value that works for majority of scenarios i can imagine.

i’m not able to open the link. can you share it from the public code repo?

thanks for the update. i still think the buffer size is way more (indeed, 100%) up to the user unlike the example where some nice parameter can help things work faster; without hurting affecting any of the core feature.

Let's keep a constant as a default ? I don't want most users thinking too much about this when using the Shuffler.

@divyanshk what would be a good constant? being someone who are not sure about having a default value, i'm prob not the best person to decide it.

Let's do 1000.

keunwoochoi · 2025-03-19T17:02:46Z

hi @divyanshk , thanks for the careful review. i updated the code per the review comments. most of the requested changes are done, i think one exception is if we should allow like if n=0 which imo we should.

divyanshk · 2025-03-20T22:22:12Z

@keunwoochoi Thank you for the updates, will get back soon!

keunwoochoi · 2025-03-31T19:41:47Z

@divyanshk gently reminding of this :)

test/nodes/test_cycler.py

divyanshk · 2025-03-31T20:35:31Z

test/nodes/test_shuffler.py

+    @parameterized.expand(itertools.product([0, 3, 7]))
+    def test_save_load_state(self, midpoint: int) -> None:
+        # This test is now expected to fail since we don't save the buffer
+        # in the state, which changes the behavior after loading state
+        pass


Probably not a good idea to include the test if it is meant to fail.

Regarding restoring state, since we store RNG_STATE_KEY in state, we should be able to restore where we left off right?

We should restore state else it might make the entire pipeline which has a Shuffler hard to use.

oops. and yes 100% agree

divyanshk · 2025-03-31T20:36:54Z

@keunwoochoi Thanks for the reminder, and for waiting.

This is looking good. Left a minor comment on adding a default value for shuffler buffer size, and using state utility function in tests. And a minor-ish comment on restoring state for shuffler using rng state. Thanks.

keunwoochoi · 2025-04-04T04:53:43Z

@divyanshk thanks for the review again!

mostly done if not all except having a default value the shuffle buffer. but i'm sitting in a flight that is about to depart, i may wanna have another look and/or you can have a look too if you have some time ;)

keunwoochoi · 2025-04-06T01:04:46Z

reviewed it again. looks good to me, asking for hopefully the last review :)

torchdata/nodes/cycler.py

divyanshk · 2025-04-09T18:01:16Z

test/nodes/test_shuffler.py

+        # Save state and create a new node
+        state = node.state_dict()
+        new_source = StatefulRangeNode(n=n)
+        new_node = Shuffler(new_source, buffer_size=5, seed=42)


The state restoration would break if someone restarts with a different buffer_size. Right ?

This is common sense that if someone changes their data pipeline after stopping, then they are bound to get different results - but on the other hand part of me thinks we should capture buffer_size as state and throw an error. Not required for this PR, maybe something for later on. LMK your thought?

I agree.
We can add a check if shuffle params are the same or not when we try to reload.

yeah i'm also 50:50. in some sense, i might wish i could just resume with a different shuffle buffer size. but perhaps that wish should be realized by other, general solution e.g., by adding some method to ignore the data loader states.

adding BUFFER_SIZE_KEY in the new commit.

torchdata/nodes/shuffler.py

divyanshk · 2025-04-09T18:28:37Z

test/nodes/test_shuffler.py

+        # The combined sequence will have fewer items than expected because
+        # we don't preserve the buffer in the state. We expect to lose
+        # approximately buffer_size items.
+        buffer_size = 5


I am missing something. Can you explain this ?

We have to ensure there is no drop in items on resumption.

Do we need a "fast forward" mode on loading state to get us to return the right elements back?

Agreed.
Intended behavior should be no difference if yielded all 20 in one run vs a run with interruptions.
A simple but inefficient solution would be to reset the source and fast forward the yielded items.
Another one (which you might have already explored) would be to store the buffer as it is.

These solutions both have their drawbacks.

If we reset the sources and fast forward, then we are paying more cost when we restart, but we ONLY pay the cost iff we restart.

On the other hand, if we save the buffer we pay the cost every time we do .get_state() (some people might do it every step).

Maybe we give both options to the user (might create confusion) and let them choose.

Intended behavior should be no difference if yielded all 20 in one run vs a run with interruptions.
+1

but overall, this is a tricky problem. i can totally imagine users appreciate a non-stateful shuffler for efficiency. i'm actually down to excluding shuffler in this PR and finish other nodes first while discussing this. what do you think?

. i'm actually down to excluding shuffler in this PR and finish other nodes first while discussing this. what do you think?

Yes, that will be easier and cleaner. Thanks for doing that!

keunwoochoi · 2025-04-17T15:20:35Z

just removed Shuffler from this PR, partly because i'm getting busier these days while i still would like to get this PR merged sooner than later.

divyanshk

Looks good to me! Thank you for this PR!

keunwoochoi · 2025-04-20T07:31:02Z

nice! i'm not familiar with the further CI tests but please let me know if there's anything i can do.

ramanishsingh · 2025-04-20T19:26:06Z

Some CI tests are failing due to an issue which are fixed in #1477. Your PR is fine. :)

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 7, 2025

divyanshk reviewed Mar 12, 2025

View reviewed changes

torchdata/nodes/cycler.py Show resolved Hide resolved

divyanshk reviewed Mar 12, 2025

View reviewed changes

torchdata/nodes/header.py Show resolved Hide resolved

divyanshk reviewed Mar 12, 2025

View reviewed changes

divyanshk reviewed Mar 31, 2025

View reviewed changes

test/nodes/test_cycler.py Show resolved Hide resolved

divyanshk reviewed Mar 31, 2025

View reviewed changes

divyanshk reviewed Apr 9, 2025

View reviewed changes

torchdata/nodes/cycler.py Show resolved Hide resolved

divyanshk reviewed Apr 9, 2025

View reviewed changes

torchdata/nodes/shuffler.py Outdated Show resolved Hide resolved

divyanshk reviewed Apr 9, 2025

View reviewed changes

keunwoochoi changed the title ~~Adding Cycler, Header, Shuffler~~ Adding Cycler, Header Apr 17, 2025

divyanshk approved these changes Apr 17, 2025

View reviewed changes

keunwoochoi added 8 commits April 21, 2025 13:12

ready for review

9d27ecd

update after review.

e1923de

update test

5559bd0

comments addressed (2)

d5d55d3

re-ordering the test

88b31bc

add BUFFER_SIZE_KEY to shuffler

a96e1d8

roll back to non-ffwd shuffler

bc90dc1

excluding shuffler

f5c3b4a

ramanishsingh force-pushed the k/add-more-nodes branch from 106fbc1 to f5c3b4a Compare April 21, 2025 20:12

ramanishsingh merged commit cda7d1a into pytorch:main Apr 29, 2025
60 of 62 checks passed

		self._num_cycles = initial_state.get(self.NUM_CYCLES_KEY, 0)
		self._has_started = initial_state.get(self.HAS_STARTED_KEY, False)

Adding Cycler, Header #1461

Adding Cycler, Header #1461

Uh oh!

Conversation

keunwoochoi commented Mar 7, 2025

Changes

Uh oh!

divyanshk commented Mar 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

divyanshk Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pytorch-bot bot commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1461

✅ No Failures

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

divyanshk Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keunwoochoi commented Mar 19, 2025

Uh oh!

divyanshk commented Mar 20, 2025

Uh oh!

keunwoochoi commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

divyanshk commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keunwoochoi commented Apr 4, 2025

Uh oh!

keunwoochoi commented Apr 6, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

divyanshk Mar 12, 2025 •

edited

Loading

pytorch-bot bot commented Mar 12, 2025 •

edited

Loading

divyanshk Mar 12, 2025 •

edited

Loading

keunwoochoi commented Mar 31, 2025 •

edited

Loading

divyanshk commented Mar 31, 2025 •

edited

Loading

divyanshk Apr 9, 2025 •

edited

Loading

ramanishsingh Apr 17, 2025 •

edited

Loading