@@ -146,8 +146,7 @@ The reason why ``n_sample = 12`` is because ``ShardingFilter`` (``datapipe.shard
146
146
each worker will independently return all samples. In this case, there are 10 rows per file and 3 files, with a
147
147
batch size of 5, that gives us 6 batches per worker. With 2 workers, we get 12 total batches from the ``DataLoader ``.
148
148
149
- In order for DataPipe sharding to work with ``DataLoader ``, we need to add the following. It is crucial to add
150
- ``ShardingFilter `` after ``Shuffler `` to ensure that all worker processes have the same order of data for sharding.
149
+ In order for DataPipe sharding to work with ``DataLoader ``, we need to add the following.
151
150
152
151
.. code :: python
153
152
@@ -169,6 +168,12 @@ Note:
169
168
170
169
- Place ``ShardingFilter `` (``datapipe.sharding_filter ``) as early as possible in the pipeline, especially before expensive
171
170
operations such as decoding, in order to avoid repeating these expensive operations across worker/distributed processes.
171
+ - For the data source that needs to be sharded, it is crucial to add ``Shuffler `` before ``ShardingFilter ``
172
+ to ensure data are globally shuffled before splitted into shards. Otherwise, each worker process would
173
+ always process the same shard of data for all epochs. And, it means each batch would only consist of data
174
+ from the same shard, which leads to low accuracy during training. However, it doesn't apply to the data
175
+ source that has already been sharded for each multi-/distributed process, since ``ShardingFilter `` is no
176
+ longer required to be presented in the pipeline.
172
177
- There may be cases where placing ``Shuffler `` earlier in the pipeline lead to worse performance, because some
173
178
operations (e.g. decompression) are faster with sequential reading. In those cases, we recommend decompressing
174
179
the files prior to shuffling (potentially prior to any data loading).
0 commit comments