-
Notifications
You must be signed in to change notification settings - Fork 160
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rescalability layer #1455
base: main
Are you sure you want to change the base?
Rescalability layer #1455
Conversation
Thanks for the work, @daviswer! Some first-level comments:
|
Thanks @scotts , I made changes 2-4 and working on unit tests now. I'll note that |
A preferred format as we can load document chunks without having to ever pull | ||
the entire document or shard file, allowing for graceful handling of large documents. | ||
Non-standard data format, though. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to confirm my understanding of the format of the pyarrow shard files. I am imagining a very large text file made up a thousands of tokens. That file is broken in multiple PyArrow shard files. Each of those PyArrow shard files is made up of multiple RecordBatches, each with a tokens field which is a list of tokens. That means each token is a 'row' (in that sense that RecordBatches are supposed to be a batch of records/rows). Is that right ?
Additionally, why do we not consider having list of tokens as a single row in the recordbatch? What is the value addition of using recordbatches here? Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The general assumption is that each pyarrow shard file represents a collection of documents, rather than a single large document getting split over multiple files. But yes, each file is a collection of RecordBatches, each of which contains a single 'row' of text (a single document) in the tokens
field. We use RecordBatches because that's how pyarrow docs suggest reading/writing random-access memory-mapped files, and we put a single document per RecordBatch to minimize overhead of loading individual documents in random order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. Can every document (even large ones) fit in a single RecordBatch ?
The doc string mentions "as we can load document chunks without having to ever pull the entire document", here we are still referring to a single document being loaded in a RecordBatch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes - because pyarrow files are memory-mapped, RecordBatches (and slices) are loaded lazily. Until you request the actual slice, the RecordBatch is just metadata, so it can hold a document of any size without extra overhead
Sharing some points that we discussed over the call.
|
Implements rescaling of checkpoints to different world sizes and numbers of workers. User specifies in advance the number of data partitions, and when saving/loading checkpoints with different total workers, stateful guarantees are maintained: seen data is not revisited until the next epoch.
Based off of the datasets in the corresponding IBM torchtitan PR, but with an adjusted rescaling and iteration mechanism to support greater flexibility and robustness (removes divisibility constraints from worker and shard counts, and guarantees only one open file per physical worker regardless of number of logical shards). Uses StatefulDataLoader and DCP to manage checkpointing from the master process. An epoch completion testing script is included for demo purposes. It is possible that the IBM datasets can be merged into the existing torchdata Nodes structure.
Changes
torchdata/stateful_dataloader/ibm_rescalable.py
examples/ibm_rescaling/rescaling_demo.py