Skip to content

Commit acbc4b6

Browse files
NivekTejguan
authored andcommitted
Improve note on shuffling behavior in tutorial (#688)
Summary: Pull Request resolved: #688 Fixes #668 Let me know if the added note is unclear and we can improve upon it. Test Plan: Imported from OSS Reviewed By: ejguan Differential Revision: D38129786 Pulled By: NivekT fbshipit-source-id: 90ebd43ec448394146bb9136db58d07f0ae74aa4
1 parent a2579ba commit acbc4b6

File tree

1 file changed

+4
-1
lines changed

1 file changed

+4
-1
lines changed

docs/source/tutorial.rst

+4-1
Original file line numberDiff line numberDiff line change
@@ -147,7 +147,7 @@ each worker will independently return all samples. In this case, there are 10 ro
147147
batch size of 5, that gives us 6 batches per worker. With 2 workers, we get 12 total batches from the ``DataLoader``.
148148

149149
In order for DataPipe sharding to work with ``DataLoader``, we need to add the following. It is crucial to add
150-
`ShardingFilter` after `Shuffler` to ensure that all worker processes have the same order of data for sharding.
150+
``ShardingFilter`` after ``Shuffler`` to ensure that all worker processes have the same order of data for sharding.
151151

152152
.. code:: python
153153
@@ -169,6 +169,9 @@ Note:
169169

170170
- Place ``ShardingFilter`` (``datapipe.sharding_filter``) as early as possible in the pipeline, especially before expensive
171171
operations such as decoding, in order to avoid repeating these expensive operations across worker/distributed processes.
172+
- There may be cases where placing ``Shuffler`` earlier in the pipeline lead to worse performance, because some
173+
operations (e.g. decompression) are faster with sequential reading. In those cases, we recommend decompressing
174+
the files prior to shuffling (potentially prior to any data loading).
172175

173176

174177
You can find more DataPipe implementation examples for various research domains `on this page <examples.html>`_.

0 commit comments

Comments
 (0)