Fix tutorial about shuffing before sharding (#715)

ejguan · ejguan · commit 93758990a508 · 2022-08-05T14:01:37.000-04:00
Summary: Fixes #709 Per title Pull Request resolved: #715 Reviewed By: NivekT Differential Revision: D38432061 Pulled By: ejguan fbshipit-source-id: a8853a86efa9ca7ed6a9e76f0d51470d34513f48
diff --git a/docs/source/tutorial.rst b/docs/source/tutorial.rst
@@ -146,8 +146,7 @@ The reason why ``n_sample = 12`` is because ``ShardingFilter`` (``datapipe.shard
 each worker will independently return all samples. In this case, there are 10 rows per file and 3 files, with a
 batch size of 5, that gives us 6 batches per worker. With 2 workers, we get 12 total batches from the ``DataLoader``.
 
-In order for DataPipe sharding to work with ``DataLoader``, we need to add the following. It is crucial to add
-``ShardingFilter`` after ``Shuffler`` to ensure that all worker processes have the same order of data for sharding.
+In order for DataPipe sharding to work with ``DataLoader``, we need to add the following.
 
 .. code:: python
 
@@ -169,6 +168,12 @@ Note:
 
 - Place ``ShardingFilter`` (``datapipe.sharding_filter``) as early as possible in the pipeline, especially before expensive
   operations such as decoding, in order to avoid repeating these expensive operations across worker/distributed processes.
+- For the data source that needs to be sharded, it is crucial to add ``Shuffler`` before ``ShardingFilter``
+  to ensure data are globally shuffled before splitted into shards. Otherwise, each worker process would
+  always process the same shard of data for all epochs. And, it means each batch would only consist of data
+  from the same shard, which leads to low accuracy during training. However, it doesn't apply to the data
+  source that has already been sharded for each multi-/distributed process, since ``ShardingFilter`` is no
+  longer required to be presented in the pipeline.
 - There may be cases where placing ``Shuffler`` earlier in the pipeline lead to worse performance, because some
   operations (e.g. decompression) are faster with sequential reading. In those cases, we recommend decompressing
   the files prior to shuffling (potentially prior to any data loading).