Skip to content

[WIP] Adds gpu minhash support for RayBTSMinhashDeduplicator #644

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: feat/ayushdg/gpu-minhash-poc
Choose a base branch
from

Conversation

cyruszhang
Copy link
Collaborator

No description provided.

@cyruszhang cyruszhang requested review from chenyushuo and yxdyc April 17, 2025 20:32
@cyruszhang cyruszhang changed the title Adds gpu minhash support for RayBTSMinhashDeduplicator [WIP] Adds gpu minhash support for RayBTSMinhashDeduplicator Apr 17, 2025
@cyruszhang cyruszhang marked this pull request as draft April 17, 2025 20:43
@cyruszhang cyruszhang changed the base branch from main to feat/ayushdg/gpu-minhash-poc April 17, 2025 20:45
@HYLcool HYLcool requested a review from pan-x-c April 22, 2025 01:12
@HYLcool HYLcool added enhancement New feature or request dj:op issues/PRs about some specific OPs dj:efficiency regarding to efficiency issues and enhancements labels Apr 22, 2025
Copy link
Collaborator

@pan-x-c pan-x-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please see the inline comments, others LGTM

@@ -80,6 +81,9 @@ def __init__(
self.max_pending_edge_buffer_task = max_pending_edge_buffer_task
self.num_edge_buffer_task_returns = num_edge_buffer_task_returns

def get_hash_table(self):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method can be removed

@@ -396,7 +451,7 @@ def tokenization_func(text):
gen.randint(1, MERSENNE_PRIME, dtype=np.uint64),
gen.randint(0, MERSENNE_PRIME, dtype=np.uint64),
) for _ in range(self.num_permutation)],
dtype=np.uint64,
dtype=np.uint32,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may break some constraints, it's better to keep uint64

gen.randint(1, MERSENNE_PRIME, dtype=np.uint64),
gen.randint(0, MERSENNE_PRIME, dtype=np.uint64),
) for _ in range(256)],
dtype=np.uint32,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar issue

concurrency=3,
batch_size=self.minhash_batch_size,
)
dataset.map_batches(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way to merge these two map_batches into one? Adding additional map_batches may increase network overhead.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to combine the 2 map batches into a single call by moving all the banding logic into the GPU minhash actor, but since the number of GPUs/concurrency level for this stage might be much lesser than the total CPUs available it might reduce the concurrency for banding. It might be a tradeoff between networking overhead (via the object store) vs fewer actors doing the banding. I'm not sure which is more optimal.

concurrency=3,
batch_size=self.minhash_batch_size,
)
dataset.map_batches(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should be able to combine the 2 map batches into a single call by moving all the banding logic into the GPU minhash actor, but since the number of GPUs/concurrency level for this stage might be much lesser than the total CPUs available it might reduce the concurrency for banding. It might be a tradeoff between networking overhead (via the object store) vs fewer actors doing the banding. I'm not sure which is more optimal.

batch_format='pyarrow',
zero_copy_batch=True,
num_gpus=1,
concurrency=3,
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I artificially set this during testing but we would want this to be configurable. I'm not sure what the best approach/config is for this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a gpu_actor_concurrency parameter to __init__ method is okay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dj:efficiency regarding to efficiency issues and enhancements dj:op issues/PRs about some specific OPs enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants