Rescalability layer #1455

daviswer · 2025-02-25T22:16:28Z

Implements rescaling of checkpoints to different world sizes and numbers of workers. User specifies in advance the number of data partitions, and when saving/loading checkpoints with different total workers, stateful guarantees are maintained: seen data is not revisited until the next epoch.

Based off of the datasets in the corresponding IBM torchtitan PR, but with an adjusted rescaling and iteration mechanism to support greater flexibility and robustness (removes divisibility constraints from worker and shard counts, and guarantees only one open file per physical worker regardless of number of logical shards). Uses StatefulDataLoader and DCP to manage checkpointing from the master process. An epoch completion testing script is included for demo purposes. It is possible that the IBM datasets can be merged into the existing torchdata Nodes structure.

Changes

Add IBM rescalable datasets and checkpointing functions to torchdata/stateful_dataloader/ibm_rescalable.py
Add demo script and correctness check to examples/ibm_rescaling/rescaling_demo.py

scotts · 2025-02-28T20:18:39Z

Thanks for the work, @daviswer! Some first-level comments:

Let's create some unit tests based on the example demo. I think it makes sense for them to live in test/stateful_dataloader.
Let's name the Python file after the main abstraction users will use from so it, so ibm_rescalable.py should become scalable_reader.py.
The name _WrapperDataset is really generic. From the code and your comments, I think a name closer to the capability it provides might be _NestedStatefulDataset.
The class _ShardFileHandler should probably be an abstract base class. And since we anticipate that others may end up creating their own shard handlers for other formats, we should probably consider a public API, so we should drop the leading _. We might also want to break it out into its own file, such as shard_handler.py. Then all future shard handlers would go in there.

daviswer · 2025-03-05T19:01:32Z

Thanks @scotts , I made changes 2-4 and working on unit tests now. I'll note that _StatefulDataset and _NestedStatefulDataset largely represent legacy code, gluing things together until we decide we either want to merge this into Nodes, or use these to represent stateful datasets (in which case we'll need to rework them anyways with contracts/APIs/etc. per #1456)

divyanshk · 2025-03-13T16:33:28Z

torchdata/stateful_dataloader/scalable_reader.py

+    A preferred format as we can load document chunks without having to ever pull
+    the entire document or shard file, allowing for graceful handling of large documents.
+    Non-standard data format, though.
+    """


I wanted to confirm my understanding of the format of the pyarrow shard files. I am imagining a very large text file made up a thousands of tokens. That file is broken in multiple PyArrow shard files. Each of those PyArrow shard files is made up of multiple RecordBatches, each with a tokens field which is a list of tokens. That means each token is a 'row' (in that sense that RecordBatches are supposed to be a batch of records/rows). Is that right ?

Additionally, why do we not consider having list of tokens as a single row in the recordbatch? What is the value addition of using recordbatches here? Thanks.

The general assumption is that each pyarrow shard file represents a collection of documents, rather than a single large document getting split over multiple files. But yes, each file is a collection of RecordBatches, each of which contains a single 'row' of text (a single document) in the tokens field. We use RecordBatches because that's how pyarrow docs suggest reading/writing random-access memory-mapped files, and we put a single document per RecordBatch to minimize overhead of loading individual documents in random order.

Cool. Can every document (even large ones) fit in a single RecordBatch ?
The doc string mentions "as we can load document chunks without having to ever pull the entire document", here we are still referring to a single document being loaded in a RecordBatch?

Yes - because pyarrow files are memory-mapped, RecordBatches (and slices) are loaded lazily. Until you request the actual slice, the RecordBatch is just metadata, so it can hold a document of any size without extra overhead

divyanshk · 2025-03-14T20:03:24Z

Sharing some points that we discussed over the call.

The core work here is in the data access layer, nor particularly in the data loader. I imagine we can figure out a way to give end users (think PyTorch users with established Dataset class) a RescalableDataset wrapper which converts their existing Dataset into ones which can be rescaled if one decides to start and re-start a job with a different world size. The ScalableReader is effectively that, although we should wonder if want to make the user give more inputs (like a custom file handler) or we can configure those inside the rescalable dataset wrapper.

This can feed directly into a canonical StatefulDataLoader, with {save, load}_distributed_state_dict functionality incorporated into StatefulDataLoader's state_dict / load_state_dict methods as special cases for RescalableDataset. At this point I don't know how feasible that is (@daviswer brought up a good point whether we want to take a dependency on DCP, vs having a generic interface for any checkpointing API to work) but this seems like a simpler interface for users to onboard to.
So far the implementation is solving for text-heavy AI workloads. We should also align on whether we want to extend the scope to include other styles, like for eg, a data row being an arbitrary Dict[str, Any], typical map-style datasets, typical HuggingFace vision datasets, etc.
I need to look at some internal data-access layer APIs to ensure we don't diverge too much.

@scotts @daviswer

daviswer · 2025-03-26T21:46:28Z

Some thoughts on point 2 @divyanshk : we could definitely separate this out into a file-interface system, plus a separate rescaling-enabled interface between nested iterable Datasets and generic indexable structures. However, we may lose some capabilities in the process. In particular, the current approach is set up to a) handle sharding of indexable structures where the total number of indices is not known in advance (i.e. many shard files containing many documents, with limited access bandwidth) and b) ensure that no more than one file per device is open/active at a time, regardless of number of files/devices. If we abstract away notions of files/items behind a generic indexable interface, it becomes harder to maintain these guarantees.

It may be possible to still make that work but I'd have to think through the approach some more.

@scotts

pradeepfn · 2025-04-21T12:28:05Z

torchdata/stateful_dataloader/scalable_reader.py

+    ranked_state = {k:dstate.pop(k) for k in keys if "rank" in k}
+    ranked_keylist = sorted(list(ranked_state.keys()))
+    compiled_ranked = [ranked_state[k] for k in ranked_keylist]
+    dstate[ranked_keylist[0][6:]] = compiled_ranked  # Drop "rank0." prefix


I need some clarity on the custom re-scaling state. Will sync on this offline.

divyanshk · 2025-04-23T23:31:32Z

torchdata/stateful_dataloader/scalable_reader.py

+
+def __pop_dstate(state, device_mesh, placements, create_dtensor=False):
+    """
+    Removes worker states from the StatefulDataLoader state dict, and fuses them into a single dict


Similar to the question I asked on Slack the other day, does it even make sense to load state after rescaling if what is loaded is only partial, for eg we throw loader state here or by disregarding scalars or RNG.

Yes partial loading makes sense - for example we might have a data buffer coupled with an RNG. On rescaling, we would want to preserve and reshard the data buffer, but don't need to hold onto the RNG state(s), since we now have a different number of them

daviswer added 30 commits November 21, 2024 19:05

Add distributed datasets

45b0bce

Formatting, commenting

e486614

Add demo script

10e45b9

Datapath None

10a6f66

Shift dummydata seeding to setup, dummy path handling

0281897

Actually create dummy data folders

a175c3c

Remove cfg ref

957a5bf

Remove double () call

2e9bdf0

Fix dist checkpoint import

e475eec

Check ckp subfolder existence, not working folder

eac8ef6

Save vals for checking

afd0169

Load dummy gen state always

031d67c

Setup calls in dummy

d9a575b

Diag print

157f90b

Remove sampling

91f1b14

Path in dummy build

b3569e3

Path in dummy build

0faea8c

Scalable off

0be44e4

Build data folder early

c54aed2

Avoid resetting gen each state dict call

a16ffb1

Diag print off, all datasets on

b645aea

Stop saving vals

ceffd24

Attempt single blob save

d2eb12e

Attempt single blob load

ada91ec

Prevent loading in place

9bf8f3d

Cleanup

934d37b

ScalableReader changes

8d0cfd8

Fix datapath folder creation

e633e60

Create datapath subfolder, data only when nonexistent

1f2e37a

Build data only rank 0

0acdf05

daviswer added 12 commits February 19, 2025 15:23

Attempt key-free load impl

da5991b

Allow full run

9037800

Direct import

5f10ac1

Precise import

8931620

gloo backend

3a6e255

Diag print

ba96958

Specify keys

3ffb475

Set constructor

95cf494

Avoid popping keys mid iter

4a592b7

Diag print

c37b8ba

diag print off

0b09fd4

Clean up and comment out

71b78dc

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 25, 2025

Merge branch 'main' into loader-dcp

e2b35aa

daviswer mentioned this pull request Feb 25, 2025

Discussion: DCP APIs and broader contracts for rescalability #1456

Open

daviswer added 3 commits March 5, 2025 13:42

Refactor -ibm

baf9c13

abc shard handler

88d993f

Refactor wrapperdataset

21db516

divyanshk reviewed Mar 13, 2025

View reviewed changes

First draft unit tests

99fb2af

daviswer added 4 commits March 24, 2025 16:41

No direct import

a879ce1

name

31745ad

Separate name and data

c16a5e0

separate name and data p2

1acb3be

pradeepfn reviewed Apr 21, 2025

View reviewed changes

divyanshk reviewed Apr 23, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rescalability layer #1455

Rescalability layer #1455

Uh oh!

daviswer commented Feb 25, 2025

Uh oh!

scotts commented Feb 28, 2025 •

edited

Loading

Uh oh!

daviswer commented Mar 5, 2025

Uh oh!

divyanshk Mar 13, 2025

Uh oh!

daviswer Mar 13, 2025

Uh oh!

divyanshk Mar 14, 2025

Uh oh!

daviswer Mar 14, 2025

Uh oh!

divyanshk commented Mar 14, 2025 •

edited

Loading

Uh oh!

daviswer commented Mar 26, 2025

Uh oh!

pradeepfn Apr 21, 2025

Uh oh!

divyanshk Apr 23, 2025

Uh oh!

daviswer Apr 24, 2025

Uh oh!

Uh oh!

Rescalability layer #1455

Are you sure you want to change the base?

Rescalability layer #1455

Uh oh!

Conversation

daviswer commented Feb 25, 2025

Changes

Uh oh!

scotts commented Feb 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daviswer commented Mar 5, 2025

Uh oh!

divyanshk Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

daviswer Mar 13, 2025

Choose a reason for hiding this comment

Uh oh!

divyanshk Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

daviswer Mar 14, 2025

Choose a reason for hiding this comment

Uh oh!

divyanshk commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

daviswer commented Mar 26, 2025

Uh oh!

pradeepfn Apr 21, 2025

Choose a reason for hiding this comment

Uh oh!

divyanshk Apr 23, 2025

Choose a reason for hiding this comment

Uh oh!

daviswer Apr 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

scotts commented Feb 28, 2025 •

edited

Loading

divyanshk commented Mar 14, 2025 •

edited

Loading