-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding Cycler, Header, Shuffler #1461
base: main
Are you sure you want to change the base?
Conversation
Thanks for the PR, reviewing.... Also kicking off the CI. |
self._num_cycles += 1 | ||
self.source.reset(None) | ||
|
||
# Try again - if it's empty, this will raise StopIteration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point, the source shouldn't be empty after the reset. If it was empty it would raise in line 64.
If this makes sense, we can update the comment.
"""Get the current state of the node. | ||
|
||
Returns: | ||
A dictionary containing the state of the source node and number of cycles completed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Can we make it "Dict[str, Any] - A dictionary containing the state of the source node and number of cycles completed."? Ditto in other nodes in the PR, thanks.
self._num_cycles = initial_state.get(self.NUM_CYCLES_KEY, 0) | ||
self._has_started = initial_state.get(self.HAS_STARTED_KEY, False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should not have default values here. If the state is setup wrongly this can lead to unexpected behavior.
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/data/1461
Note: Links to docs will display an error until the docs builds have been completed. ❌ 1 New FailureAs of commit fbbc425 with merge base 327e225 ( NEW FAILURE - The following job has failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
|
||
SOURCE_KEY = "source" | ||
NUM_CYCLES_KEY = "num_cycles" | ||
HAS_STARTED_KEY = "has_started" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also include "num_yielded".
|
||
def __init__(self, source_node: BaseNode[T], n: int): | ||
super().__init__() | ||
if n < 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be throw a ValueError when n = 0 ? that is like setting up a node to not do anything, seems like wrong user input.
self.RNG_STATE_KEY: self.rng.getstate(), | ||
self.BUFFER_KEY: list(self.buffer), | ||
self.NUM_SHUFFLED_KEY: self._num_shuffled, | ||
self.RANDOM_STATE_KEY: self.rng.getstate(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RANDOM_STATE_KEY is a duplicate of RNG_STATE_KEY, let's remove RANDOM_STATE_KEY
while len(self.buffer) < self.buffer_size: | ||
self.buffer.append(next(self.source)) | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This return True
even when we do not enter the while loop. Is that expected ?
Do we want _fill_buffer()
to return true if there are elements in the buffer or the call to the function led to elements being added ?
RNG_STATE_KEY = "rng_state" | ||
BUFFER_KEY = "buffer" | ||
NUM_SHUFFLED_KEY = "num_shuffled" | ||
RANDOM_STATE_KEY = "random_state" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets include num yielded in state as well.
return { | ||
self.SOURCE_KEY: self.source.state_dict(), | ||
self.RNG_STATE_KEY: self.rng.getstate(), | ||
self.BUFFER_KEY: list(self.buffer), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we avoid keeping the entire buffer as state ? since this would be a list of node objects it might not make sense.
|
||
Args: | ||
source_node (BaseNode[T]): The source node to pull items from. | ||
buffer_size (int): Size of the buffer used for shuffling. Must be at least 1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking from a user POV. Would they have a strong opinion on what the buffer_size
argument should be? Maybe not.
I fear this would end up being set as an arbitrarily large number just to maximize shuffling capacity.
Should we see if we can make it work without buffer_size and seed as an argument? We have something like this in MultiNodeWeightedSampler (link below)
Update: updated link: https://github.com/pytorch/data/blob/main/torchdata/nodes/samplers/multi_node_weighted_sampler.py#L223
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from my experience, i actually believe we should definitely let user control this. the ideal shuffle buffer size depends a lot per data source in my use-cases.
for example, if a data source is not pre-shuffled globally and only sharded (which sucks but it would a good use-case of Shuffler,) the buffer size should be as large as the number of items in each shard.
if the buffer size is too large, yes a lot of other issues can occur. but perhaps that’s up to users really and they should understand how Shuffler works? especially since i’m not really sure there’s a non-trivial method to set a value that works for majority of scenarios i can imagine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i’m not able to open the link. can you share it from the public code repo?
Please read through our contribution guide prior to
creating your pull request.
functionality and testing.
Following up discussion #1452 and my previous PR #1454
Changes
Cycler
,Header
,Shuffler
and their tests.