Adds gpu minhash support for RayBTSMinhashDeduplicator #644

cyruszhang · 2025-04-17T20:32:40Z

No description provided.

Signed-off-by: Ayush Dattagupta <[email protected]>

pan-x-c

Please see the inline comments, others LGTM

pan-x-c · 2025-04-22T05:22:13Z

data_juicer/ops/deduplicator/ray_bts_minhash_deduplicator.py

@@ -80,6 +81,9 @@ def __init__(
        self.max_pending_edge_buffer_task = max_pending_edge_buffer_task
        self.num_edge_buffer_task_returns = num_edge_buffer_task_returns

+    def get_hash_table(self):


This method can be removed

pan-x-c · 2025-04-22T05:23:47Z

data_juicer/ops/deduplicator/ray_bts_minhash_deduplicator.py

@@ -396,7 +451,7 @@ def tokenization_func(text):
                gen.randint(1, MERSENNE_PRIME, dtype=np.uint64),
                gen.randint(0, MERSENNE_PRIME, dtype=np.uint64),
            ) for _ in range(self.num_permutation)],
-            dtype=np.uint64,
+            dtype=np.uint32,


This may break some constraints, it's better to keep uint64

pan-x-c · 2025-04-22T05:26:29Z

data_juicer/ops/deduplicator/ray_bts_minhash_deduplicator.py

+                    gen.randint(1, MERSENNE_PRIME, dtype=np.uint64),
+                    gen.randint(0, MERSENNE_PRIME, dtype=np.uint64),
+                ) for _ in range(256)],
+                dtype=np.uint32,


Similar issue

pan-x-c · 2025-04-22T05:37:28Z

data_juicer/ops/deduplicator/ray_bts_minhash_deduplicator.py

+                concurrency=3,
+                batch_size=self.minhash_batch_size,
+            )
+            dataset.map_batches(


Is there any way to merge these two map_batches into one? Adding additional map_batches may increase network overhead.

We should be able to combine the 2 map batches into a single call by moving all the banding logic into the GPU minhash actor, but since the number of GPUs/concurrency level for this stage might be much lesser than the total CPUs available it might reduce the concurrency for banding. It might be a tradeoff between networking overhead (via the object store) vs fewer actors doing the banding. I'm not sure which is more optimal.

ayushdg · 2025-04-22T20:23:29Z

data_juicer/ops/deduplicator/ray_bts_minhash_deduplicator.py

+                concurrency=3,
+                batch_size=self.minhash_batch_size,
+            )
+            dataset.map_batches(


We should be able to combine the 2 map batches into a single call by moving all the banding logic into the GPU minhash actor, but since the number of GPUs/concurrency level for this stage might be much lesser than the total CPUs available it might reduce the concurrency for banding. It might be a tradeoff between networking overhead (via the object store) vs fewer actors doing the banding. I'm not sure which is more optimal.

ayushdg · 2025-04-22T20:23:58Z

data_juicer/ops/deduplicator/ray_bts_minhash_deduplicator.py

+                batch_format='pyarrow',
+                zero_copy_batch=True,
+                num_gpus=1,
+                concurrency=3,


I artificially set this during testing but we would want this to be configurable. I'm not sure what the best approach/config is for this.

Add a gpu_actor_concurrency parameter to __init__ method is okay.

yxdyc

LGTM. Further improvements will be implemented in the dev branch.

* Inital PoC PR that adds gpu minhash support for some cases (#644) Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> * add test util * add cudf; update uv.lock; add dedup-ray-bts.yaml * add param docstring; use unit64 for GPU operations * separate configs for CPU/GPU * fix pre-commit errors * Lazy Remote Class Registration for GPUMinHashActor * add utility for ray cluster resource checking * add head_node_participates logic * use_cuda instead of use_gpu extra param * use Actor directly; use proper config and monitoring * fix pre-commit issue: extra white line * use available GPU for cuda minhash calculation * use batch_size per available cluster GPU memory * tune up max batch size * remove temporary test util * update param doc for minhash_batch_size * remove redundant entry in .gitignore * update cudf dependency group * fix broken pyproject.toml merge * update uv.lock; use tsinghua mirror instead of aliyun, in accordance with dockerfile --------- Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]>

Inital PoC PR that adds gpu minhash support for some cases

c381dfd

Signed-off-by: Ayush Dattagupta <[email protected]>

cyruszhang requested review from chenyushuo and yxdyc April 17, 2025 20:32

cyruszhang had a problem deploying to Testing April 17, 2025 20:32 — with GitHub Actions Failure

cyruszhang changed the title ~~Adds gpu minhash support for RayBTSMinhashDeduplicator~~ [WIP] Adds gpu minhash support for RayBTSMinhashDeduplicator Apr 17, 2025

cyruszhang marked this pull request as draft April 17, 2025 20:43

cyruszhang changed the base branch from main to feat/ayushdg/gpu-minhash-poc April 17, 2025 20:45

HYLcool requested a review from pan-x-c April 22, 2025 01:12

HYLcool added enhancement New feature or request dj:op issues/PRs about some specific OPs dj:efficiency regarding to efficiency issues and enhancements labels Apr 22, 2025

github-project-automation bot added this to data-juicer Apr 22, 2025

github-project-automation bot moved this to Todo in data-juicer Apr 22, 2025

pan-x-c reviewed Apr 22, 2025

View reviewed changes

ayushdg reviewed Apr 22, 2025

View reviewed changes

cyruszhang changed the title ~~[WIP] Adds gpu minhash support for RayBTSMinhashDeduplicator~~ Adds gpu minhash support for RayBTSMinhashDeduplicator May 28, 2025

yxdyc approved these changes May 29, 2025

View reviewed changes

cyruszhang marked this pull request as ready for review May 29, 2025 15:37

cyruszhang merged commit af55d56 into modelscope:feat/ayushdg/gpu-minhash-poc May 29, 2025
0 of 2 checks passed

github-project-automation bot moved this from Todo to Done in data-juicer May 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds gpu minhash support for RayBTSMinhashDeduplicator #644

Adds gpu minhash support for RayBTSMinhashDeduplicator #644

Uh oh!

cyruszhang commented Apr 17, 2025

Uh oh!

pan-x-c left a comment

Uh oh!

pan-x-c Apr 22, 2025

Uh oh!

pan-x-c Apr 22, 2025

Uh oh!

pan-x-c Apr 22, 2025

Uh oh!

pan-x-c Apr 22, 2025

Uh oh!

ayushdg Apr 22, 2025

Uh oh!

ayushdg Apr 22, 2025

Uh oh!

ayushdg Apr 22, 2025

Uh oh!

pan-x-c Apr 23, 2025

Uh oh!

yxdyc left a comment

Uh oh!

Uh oh!

Uh oh!

Adds gpu minhash support for RayBTSMinhashDeduplicator #644

Adds gpu minhash support for RayBTSMinhashDeduplicator #644

Uh oh!

Conversation

cyruszhang commented Apr 17, 2025

Uh oh!

pan-x-c left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yxdyc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!