Add minhash deduplicator based on RAY and Redis #489

pan-x-c · 2024-11-15T07:32:11Z

As the title says

yxdyc

Nice work, plz see a few suggestions as commented

docs/Operators_ZH.md

configs/config_all.yaml

data_juicer/ops/deduplicator/ray_redis_minhash_deduplicator.py

yxdyc · 2024-11-18T06:53:31Z

data_juicer/ops/deduplicator/ray_redis_minhash_deduplicator.py

+            add_uid_column, batch_format='pyarrow').materialize()
+        dataset_with_id.map_batches(calculate_minhash,
+                                    batch_format='pyarrow').groupby(
+                                        HashKeys.minhash).aggregate(


Although we are (ideally) optimizing this via multiple distributed UnionSets, it is worth profiling this line and (then co-optimizing with Ray-Team)

add ray minhash deduplicator

4933f5d

pan-x-c had a problem deploying to Testing November 15, 2024 07:32 — with GitHub Actions Failure

pan-x-c added 2 commits November 15, 2024 16:24

fix redis prefix

6b79f90

fix redis prefix

991e290

pan-x-c temporarily deployed to Testing November 15, 2024 08:28 — with GitHub Actions Inactive

yxdyc reviewed Nov 18, 2024

View reviewed changes

fix output bug

58e357f

pan-x-c had a problem deploying to Testing November 18, 2024 09:50 — with GitHub Actions Failure

pan-x-c added 2 commits November 20, 2024 14:06

fix comments

e1b76f5

fix pre-comment

2a1a1a7

pan-x-c had a problem deploying to Testing November 20, 2024 06:11 — with GitHub Actions Failure

chenyushuo mentioned this pull request Nov 28, 2024

Add minhash deduplicator based on RAY. #502

Merged

yxdyc requested a review from chenyushuo December 20, 2024 03:27

yxdyc added dj:op issues/PRs about some specific OPs dj:dist issues/PRs about distributed data processing dj:efficiency regarding to efficiency issues and enhancements labels Dec 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add minhash deduplicator based on RAY and Redis #489

Add minhash deduplicator based on RAY and Redis #489

Uh oh!

pan-x-c commented Nov 15, 2024

Uh oh!

yxdyc left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yxdyc Nov 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add minhash deduplicator based on RAY and Redis #489

Are you sure you want to change the base?

Add minhash deduplicator based on RAY and Redis #489

Uh oh!

Conversation

pan-x-c commented Nov 15, 2024

Uh oh!

yxdyc left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yxdyc Nov 18, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants