Raise error when minibatch is used in SPMD dataloading and per host b… #8205

JackCaoG · 2024-10-02T20:38:39Z

…atch size is not divisble by data mesh

When minibatch is used in sharding spec, each device is loading it own part of the data and reassemble it during the runtime.

Imagenine a case the mesh is (4,1) and the batch size we used is 3, each device will see a data like

(100,200,300,0)
(400,500,600,0)

and runtime will reassemble a incorrect final data.

If mini batch is not used it is OK, because each device will see

(100, 200, 300,400)
(500, 600, 0, 0)

and we will drop padding at the end.

…atch size is not divisble by data mesh

torch_xla/core/xla_model.py

JackCaoG · 2024-10-03T00:21:17Z

ok let me resolve the merge conflicts

JackCaoG · 2024-10-03T16:50:48Z

@jonb377 can you take another look?

jonb377

Thanks Jack!

jonb377 · 2024-10-04T19:54:51Z

torch_xla/core/xla_model.py

+        if sharding and tensor.dim() > 0 and (tensor.size()[0] %
+                                              local_runtime_device_count) != 0:
+          raise RuntimeError(
+              "When minibatch is configured, batch dimension of the tensor " +


Maybe clarify the per-host batch size must be divisible...? These concepts are kind of confusing since we're mapping host-level sharding into device-level and representing it as a global tensor, so there's three different batch dimensions to consider.

Thanks, let me update in a follow up pr.

pytorch#8205)

Raise error when minibatch is used in SPMD dataloading and per host b…

54e070f

…atch size is not divisble by data mesh

JackCaoG added the distributed SPMD and other distributed things. label Oct 2, 2024

JackCaoG requested review from jonb377 and khatwanimohit October 2, 2024 20:38

linter

aea22ab

jonb377 reviewed Oct 2, 2024

View reviewed changes

torch_xla/core/xla_model.py Show resolved Hide resolved

fix comment

ceb6799

JackCaoG requested a review from jonb377 October 2, 2024 21:17

JackCaoG added the tpuci label Oct 2, 2024

fix test

3bf3be4

JackCaoG added 3 commits October 2, 2024 18:13

Merge branch 'master' into JackCaoG/data_loader_warm_data_size

91ffdc5

Update parallel_loader.py

35a499f

Update parallel_loader.py

de32c01

jonb377 approved these changes Oct 4, 2024

View reviewed changes

JackCaoG merged commit e3cf356 into master Oct 7, 2024
23 checks passed

JackCaoG deleted the JackCaoG/data_loader_warm_data_size branch October 7, 2024 17:13

dvhg pushed a commit to dvhg/xla that referenced this pull request Oct 7, 2024

Raise error when minibatch is used in SPMD dataloading and per host b… (

da4e540

pytorch#8205)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Raise error when minibatch is used in SPMD dataloading and per host b… #8205

Raise error when minibatch is used in SPMD dataloading and per host b… #8205

Uh oh!

JackCaoG commented Oct 2, 2024

Uh oh!

Uh oh!

JackCaoG commented Oct 3, 2024

Uh oh!

JackCaoG commented Oct 3, 2024

Uh oh!

jonb377 left a comment

Uh oh!

jonb377 Oct 4, 2024

Uh oh!

JackCaoG Oct 7, 2024

Uh oh!

Uh oh!

Uh oh!

Raise error when minibatch is used in SPMD dataloading and per host b… #8205

Raise error when minibatch is used in SPMD dataloading and per host b… #8205

Uh oh!

Conversation

JackCaoG commented Oct 2, 2024

Uh oh!

Uh oh!

JackCaoG commented Oct 3, 2024

Uh oh!

JackCaoG commented Oct 3, 2024

Uh oh!

jonb377 left a comment

Choose a reason for hiding this comment

Uh oh!

jonb377 Oct 4, 2024

Choose a reason for hiding this comment

Uh oh!

JackCaoG Oct 7, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!