Skip to content

Commit 9375899

Browse files
committed
Fix tutorial about shuffing before sharding (#715)
Summary: Fixes #709 Per title Pull Request resolved: #715 Reviewed By: NivekT Differential Revision: D38432061 Pulled By: ejguan fbshipit-source-id: a8853a86efa9ca7ed6a9e76f0d51470d34513f48
1 parent acbc4b6 commit 9375899

File tree

1 file changed

+7
-2
lines changed

1 file changed

+7
-2
lines changed

docs/source/tutorial.rst

+7-2
Original file line numberDiff line numberDiff line change
@@ -146,8 +146,7 @@ The reason why ``n_sample = 12`` is because ``ShardingFilter`` (``datapipe.shard
146146
each worker will independently return all samples. In this case, there are 10 rows per file and 3 files, with a
147147
batch size of 5, that gives us 6 batches per worker. With 2 workers, we get 12 total batches from the ``DataLoader``.
148148

149-
In order for DataPipe sharding to work with ``DataLoader``, we need to add the following. It is crucial to add
150-
``ShardingFilter`` after ``Shuffler`` to ensure that all worker processes have the same order of data for sharding.
149+
In order for DataPipe sharding to work with ``DataLoader``, we need to add the following.
151150

152151
.. code:: python
153152
@@ -169,6 +168,12 @@ Note:
169168

170169
- Place ``ShardingFilter`` (``datapipe.sharding_filter``) as early as possible in the pipeline, especially before expensive
171170
operations such as decoding, in order to avoid repeating these expensive operations across worker/distributed processes.
171+
- For the data source that needs to be sharded, it is crucial to add ``Shuffler`` before ``ShardingFilter``
172+
to ensure data are globally shuffled before splitted into shards. Otherwise, each worker process would
173+
always process the same shard of data for all epochs. And, it means each batch would only consist of data
174+
from the same shard, which leads to low accuracy during training. However, it doesn't apply to the data
175+
source that has already been sharded for each multi-/distributed process, since ``ShardingFilter`` is no
176+
longer required to be presented in the pipeline.
172177
- There may be cases where placing ``Shuffler`` earlier in the pipeline lead to worse performance, because some
173178
operations (e.g. decompression) are faster with sequential reading. In those cases, we recommend decompressing
174179
the files prior to shuffling (potentially prior to any data loading).

0 commit comments

Comments
 (0)