File tree Expand file tree Collapse file tree 1 file changed +21
-0
lines changed Expand file tree Collapse file tree 1 file changed +21
-0
lines changed Original file line number Diff line number Diff line change @@ -753,6 +753,27 @@ def partition_dataset(
753
753
And it can split the dataset based on specified ratios or evenly split into `num_partitions`.
754
754
Refer to: https://github.com/pytorch/pytorch/blob/master/torch/utils/data/distributed.py.
755
755
756
+ Note:
757
+ It also can be used to partition dataset for ranks in distributed training.
758
+ For example, partition dataset before training and use `CacheDataset`, every rank trains with its own data.
759
+ It can avoid duplicated caching content in each rank, but will not do global shuffle before every epoch:
760
+
761
+ .. code-block:: python
762
+
763
+ data_partition = partition_dataset(
764
+ data=train_files,
765
+ num_partitions=dist.get_world_size(),
766
+ shuffle=True,
767
+ even_divisible=True,
768
+ )[dist.get_rank()]
769
+
770
+ train_ds = SmartCacheDataset(
771
+ data=data_partition,
772
+ transform=train_transforms,
773
+ replace_rate=0.2,
774
+ cache_num=15,
775
+ )
776
+
756
777
Args:
757
778
data: input dataset to split, expect a list of data.
758
779
ratios: a list of ratio number to split the dataset, like [8, 1, 1].
You can’t perform that action at this time.
0 commit comments