-
Notifications
You must be signed in to change notification settings - Fork 234
[WIP] Adds gpu minhash support for RayBTSMinhashDeduplicator #644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feat/ayushdg/gpu-minhash-poc
Are you sure you want to change the base?
[WIP] Adds gpu minhash support for RayBTSMinhashDeduplicator #644
Conversation
Signed-off-by: Ayush Dattagupta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the inline comments, others LGTM
@@ -80,6 +81,9 @@ def __init__( | |||
self.max_pending_edge_buffer_task = max_pending_edge_buffer_task | |||
self.num_edge_buffer_task_returns = num_edge_buffer_task_returns | |||
|
|||
def get_hash_table(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method can be removed
@@ -396,7 +451,7 @@ def tokenization_func(text): | |||
gen.randint(1, MERSENNE_PRIME, dtype=np.uint64), | |||
gen.randint(0, MERSENNE_PRIME, dtype=np.uint64), | |||
) for _ in range(self.num_permutation)], | |||
dtype=np.uint64, | |||
dtype=np.uint32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may break some constraints, it's better to keep uint64
gen.randint(1, MERSENNE_PRIME, dtype=np.uint64), | ||
gen.randint(0, MERSENNE_PRIME, dtype=np.uint64), | ||
) for _ in range(256)], | ||
dtype=np.uint32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar issue
concurrency=3, | ||
batch_size=self.minhash_batch_size, | ||
) | ||
dataset.map_batches( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to merge these two map_batches
into one? Adding additional map_batches
may increase network overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to combine the 2 map batches into a single call by moving all the banding logic into the GPU minhash actor, but since the number of GPUs/concurrency level for this stage might be much lesser than the total CPUs available it might reduce the concurrency for banding. It might be a tradeoff between networking overhead (via the object store) vs fewer actors doing the banding. I'm not sure which is more optimal.
concurrency=3, | ||
batch_size=self.minhash_batch_size, | ||
) | ||
dataset.map_batches( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to combine the 2 map batches into a single call by moving all the banding logic into the GPU minhash actor, but since the number of GPUs/concurrency level for this stage might be much lesser than the total CPUs available it might reduce the concurrency for banding. It might be a tradeoff between networking overhead (via the object store) vs fewer actors doing the banding. I'm not sure which is more optimal.
batch_format='pyarrow', | ||
zero_copy_batch=True, | ||
num_gpus=1, | ||
concurrency=3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I artificially set this during testing but we would want this to be configurable. I'm not sure what the best approach/config is for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a gpu_actor_concurrency
parameter to __init__
method is okay.
No description provided.