Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Cycler, Header, Shuffler #1461

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

keunwoochoi
Copy link
Contributor

Please read through our contribution guide prior to
creating your pull request.

  • If you are adding a new node, ensure you read that section in the contribution guide, as it includes requirements for
    functionality and testing.

Following up discussion #1452 and my previous PR #1454

Changes

  • Adds Cycler, Header, Shuffler and their tests.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 7, 2025
@divyanshk
Copy link
Contributor

Thanks for the PR, reviewing....

Also kicking off the CI.

self._num_cycles += 1
self.source.reset(None)

# Try again - if it's empty, this will raise StopIteration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point, the source shouldn't be empty after the reset. If it was empty it would raise in line 64.

If this makes sense, we can update the comment.

"""Get the current state of the node.

Returns:
A dictionary containing the state of the source node and number of cycles completed.
Copy link
Contributor

@divyanshk divyanshk Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Can we make it "Dict[str, Any] - A dictionary containing the state of the source node and number of cycles completed."? Ditto in other nodes in the PR, thanks.

Comment on lines +39 to +40
self._num_cycles = initial_state.get(self.NUM_CYCLES_KEY, 0)
self._has_started = initial_state.get(self.HAS_STARTED_KEY, False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should not have default values here. If the state is setup wrongly this can lead to unexpected behavior.

Copy link

pytorch-bot bot commented Mar 12, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1461

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure

As of commit fbbc425 with merge base 327e225 (image):

NEW FAILURE - The following job has failed:

  • Lint / mypy (gh)
    torchdata/nodes/shuffler.py:54:31: error: Argument 1 to "setstate" of "Random"

This comment was automatically generated by Dr. CI and updates every 15 minutes.


SOURCE_KEY = "source"
NUM_CYCLES_KEY = "num_cycles"
HAS_STARTED_KEY = "has_started"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also include "num_yielded".


def __init__(self, source_node: BaseNode[T], n: int):
super().__init__()
if n < 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be throw a ValueError when n = 0 ? that is like setting up a node to not do anything, seems like wrong user input.

self.RNG_STATE_KEY: self.rng.getstate(),
self.BUFFER_KEY: list(self.buffer),
self.NUM_SHUFFLED_KEY: self._num_shuffled,
self.RANDOM_STATE_KEY: self.rng.getstate(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RANDOM_STATE_KEY is a duplicate of RNG_STATE_KEY, let's remove RANDOM_STATE_KEY

Comment on lines +69 to +71
while len(self.buffer) < self.buffer_size:
self.buffer.append(next(self.source))
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This return True even when we do not enter the while loop. Is that expected ?

Do we want _fill_buffer() to return true if there are elements in the buffer or the call to the function led to elements being added ?

RNG_STATE_KEY = "rng_state"
BUFFER_KEY = "buffer"
NUM_SHUFFLED_KEY = "num_shuffled"
RANDOM_STATE_KEY = "random_state"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets include num yielded in state as well.

return {
self.SOURCE_KEY: self.source.state_dict(),
self.RNG_STATE_KEY: self.rng.getstate(),
self.BUFFER_KEY: list(self.buffer),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we avoid keeping the entire buffer as state ? since this would be a list of node objects it might not make sense.


Args:
source_node (BaseNode[T]): The source node to pull items from.
buffer_size (int): Size of the buffer used for shuffling. Must be at least 1.
Copy link
Contributor

@divyanshk divyanshk Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking from a user POV. Would they have a strong opinion on what the buffer_size argument should be? Maybe not.

I fear this would end up being set as an arbitrarily large number just to maximize shuffling capacity.

Should we see if we can make it work without buffer_size and seed as an argument? We have something like this in MultiNodeWeightedSampler (link below)

Update: updated link: https://github.com/pytorch/data/blob/main/torchdata/nodes/samplers/multi_node_weighted_sampler.py#L223

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from my experience, i actually believe we should definitely let user control this. the ideal shuffle buffer size depends a lot per data source in my use-cases.
for example, if a data source is not pre-shuffled globally and only sharded (which sucks but it would a good use-case of Shuffler,) the buffer size should be as large as the number of items in each shard.

if the buffer size is too large, yes a lot of other issues can occur. but perhaps that’s up to users really and they should understand how Shuffler works? especially since i’m not really sure there’s a non-trivial method to set a value that works for majority of scenarios i can imagine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i’m not able to open the link. can you share it from the public code repo?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants