Skip to content

Adds gpu minhash support for RayBTSMinhashDeduplicator #644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

cyruszhang
Copy link
Collaborator

No description provided.

@cyruszhang cyruszhang requested review from chenyushuo and yxdyc April 17, 2025 20:32
@cyruszhang cyruszhang changed the title Adds gpu minhash support for RayBTSMinhashDeduplicator [WIP] Adds gpu minhash support for RayBTSMinhashDeduplicator Apr 17, 2025
@cyruszhang cyruszhang marked this pull request as draft April 17, 2025 20:43
@cyruszhang cyruszhang changed the base branch from main to feat/ayushdg/gpu-minhash-poc April 17, 2025 20:45
@HYLcool HYLcool requested a review from pan-x-c April 22, 2025 01:12
@HYLcool HYLcool added enhancement New feature or request dj:op issues/PRs about some specific OPs dj:efficiency regarding to efficiency issues and enhancements labels Apr 22, 2025
Copy link
Collaborator

@pan-x-c pan-x-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the inline comments, others LGTM

@@ -80,6 +81,9 @@ def __init__(
self.max_pending_edge_buffer_task = max_pending_edge_buffer_task
self.num_edge_buffer_task_returns = num_edge_buffer_task_returns

def get_hash_table(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can be removed

@@ -396,7 +451,7 @@ def tokenization_func(text):
gen.randint(1, MERSENNE_PRIME, dtype=np.uint64),
gen.randint(0, MERSENNE_PRIME, dtype=np.uint64),
) for _ in range(self.num_permutation)],
dtype=np.uint64,
dtype=np.uint32,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may break some constraints, it's better to keep uint64

gen.randint(1, MERSENNE_PRIME, dtype=np.uint64),
gen.randint(0, MERSENNE_PRIME, dtype=np.uint64),
) for _ in range(256)],
dtype=np.uint32,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar issue

concurrency=3,
batch_size=self.minhash_batch_size,
)
dataset.map_batches(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to merge these two map_batches into one? Adding additional map_batches may increase network overhead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to combine the 2 map batches into a single call by moving all the banding logic into the GPU minhash actor, but since the number of GPUs/concurrency level for this stage might be much lesser than the total CPUs available it might reduce the concurrency for banding. It might be a tradeoff between networking overhead (via the object store) vs fewer actors doing the banding. I'm not sure which is more optimal.

concurrency=3,
batch_size=self.minhash_batch_size,
)
dataset.map_batches(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to combine the 2 map batches into a single call by moving all the banding logic into the GPU minhash actor, but since the number of GPUs/concurrency level for this stage might be much lesser than the total CPUs available it might reduce the concurrency for banding. It might be a tradeoff between networking overhead (via the object store) vs fewer actors doing the banding. I'm not sure which is more optimal.

batch_format='pyarrow',
zero_copy_batch=True,
num_gpus=1,
concurrency=3,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I artificially set this during testing but we would want this to be configurable. I'm not sure what the best approach/config is for this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a gpu_actor_concurrency parameter to __init__ method is okay.

@cyruszhang cyruszhang changed the title [WIP] Adds gpu minhash support for RayBTSMinhashDeduplicator Adds gpu minhash support for RayBTSMinhashDeduplicator May 28, 2025
Copy link
Collaborator

@yxdyc yxdyc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Further improvements will be implemented in the dev branch.

@cyruszhang cyruszhang marked this pull request as ready for review May 29, 2025 15:37
@cyruszhang cyruszhang merged commit af55d56 into modelscope:feat/ayushdg/gpu-minhash-poc May 29, 2025
0 of 2 checks passed
@github-project-automation github-project-automation bot moved this from Todo to Done in data-juicer May 29, 2025
HYLcool pushed a commit that referenced this pull request Jun 25, 2025
* Inital PoC PR that adds gpu minhash support for some cases (#644)

Signed-off-by: Ayush Dattagupta <[email protected]>
Co-authored-by: Ayush Dattagupta <[email protected]>

* add test util

* add cudf; update uv.lock; add dedup-ray-bts.yaml

* add param docstring; use unit64 for GPU operations

* separate configs for CPU/GPU

* fix pre-commit errors

* Lazy Remote Class Registration for GPUMinHashActor

* add utility for ray cluster resource checking

* add head_node_participates logic

* use_cuda instead of use_gpu extra param

* use Actor directly; use proper config and monitoring

* fix pre-commit issue: extra white line

* use available GPU for cuda minhash calculation

* use batch_size per available cluster GPU memory

* tune up max batch size

* remove temporary test util

* update param doc for minhash_batch_size

* remove redundant entry in .gitignore

* update cudf dependency group

* fix broken pyproject.toml merge

* update uv.lock; use tsinghua mirror instead of aliyun, in accordance with dockerfile

---------

Signed-off-by: Ayush Dattagupta <[email protected]>
Co-authored-by: Ayush Dattagupta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dj:efficiency regarding to efficiency issues and enhancements dj:op issues/PRs about some specific OPs enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants