-
Notifications
You must be signed in to change notification settings - Fork 245
Adds gpu minhash support for RayBTSMinhashDeduplicator #644
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds gpu minhash support for RayBTSMinhashDeduplicator #644
Conversation
Signed-off-by: Ayush Dattagupta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the inline comments, others LGTM
@@ -80,6 +81,9 @@ def __init__( | |||
self.max_pending_edge_buffer_task = max_pending_edge_buffer_task | |||
self.num_edge_buffer_task_returns = num_edge_buffer_task_returns | |||
|
|||
def get_hash_table(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method can be removed
@@ -396,7 +451,7 @@ def tokenization_func(text): | |||
gen.randint(1, MERSENNE_PRIME, dtype=np.uint64), | |||
gen.randint(0, MERSENNE_PRIME, dtype=np.uint64), | |||
) for _ in range(self.num_permutation)], | |||
dtype=np.uint64, | |||
dtype=np.uint32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This may break some constraints, it's better to keep uint64
gen.randint(1, MERSENNE_PRIME, dtype=np.uint64), | ||
gen.randint(0, MERSENNE_PRIME, dtype=np.uint64), | ||
) for _ in range(256)], | ||
dtype=np.uint32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar issue
concurrency=3, | ||
batch_size=self.minhash_batch_size, | ||
) | ||
dataset.map_batches( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any way to merge these two map_batches
into one? Adding additional map_batches
may increase network overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to combine the 2 map batches into a single call by moving all the banding logic into the GPU minhash actor, but since the number of GPUs/concurrency level for this stage might be much lesser than the total CPUs available it might reduce the concurrency for banding. It might be a tradeoff between networking overhead (via the object store) vs fewer actors doing the banding. I'm not sure which is more optimal.
concurrency=3, | ||
batch_size=self.minhash_batch_size, | ||
) | ||
dataset.map_batches( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be able to combine the 2 map batches into a single call by moving all the banding logic into the GPU minhash actor, but since the number of GPUs/concurrency level for this stage might be much lesser than the total CPUs available it might reduce the concurrency for banding. It might be a tradeoff between networking overhead (via the object store) vs fewer actors doing the banding. I'm not sure which is more optimal.
batch_format='pyarrow', | ||
zero_copy_batch=True, | ||
num_gpus=1, | ||
concurrency=3, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I artificially set this during testing but we would want this to be configurable. I'm not sure what the best approach/config is for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a gpu_actor_concurrency
parameter to __init__
method is okay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Further improvements will be implemented in the dev branch.
af55d56
into
modelscope:feat/ayushdg/gpu-minhash-poc
* Inital PoC PR that adds gpu minhash support for some cases (#644) Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]> * add test util * add cudf; update uv.lock; add dedup-ray-bts.yaml * add param docstring; use unit64 for GPU operations * separate configs for CPU/GPU * fix pre-commit errors * Lazy Remote Class Registration for GPUMinHashActor * add utility for ray cluster resource checking * add head_node_participates logic * use_cuda instead of use_gpu extra param * use Actor directly; use proper config and monitoring * fix pre-commit issue: extra white line * use available GPU for cuda minhash calculation * use batch_size per available cluster GPU memory * tune up max batch size * remove temporary test util * update param doc for minhash_batch_size * remove redundant entry in .gitignore * update cudf dependency group * fix broken pyproject.toml merge * update uv.lock; use tsinghua mirror instead of aliyun, in accordance with dockerfile --------- Signed-off-by: Ayush Dattagupta <[email protected]> Co-authored-by: Ayush Dattagupta <[email protected]>
No description provided.