Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add converter drop duplicates experiment script #492

Merged
merged 2 commits into from
Mar 9, 2025

Conversation

Zyiqin-Miranda
Copy link
Member

Summary

This PR adds the experiment script for converter drop duplicates operations.
Approach explored satisfy the requirement that for same primary key, deterministic record will be keep based on following criteria:

  1. Records within same file: record with customized sort_order column value will be kept. Default to record index, last record index will be kept.
  2. Records across files: Records with largest file_sequence_number in Iceberg term or stream_position in DeltaCAT term will be kept.

Results/comments are added as doc strings.
References: drop duplicates performance benchmark

Checklist

  • Unit tests covering the changes have been added

    • If this is a bugfix, regression tests have been added
  • E2E testing has been performed

Additional Notes

Any additional information or context relevant to this PR.

@Zyiqin-Miranda Zyiqin-Miranda force-pushed the convert-drop-duplicates-benchmark branch from d54347c to bf8b360 Compare March 3, 2025 19:12
@Zyiqin-Miranda Zyiqin-Miranda requested a review from pdames March 3, 2025 19:13
Copy link
Member

@pdames pdames left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! I'd also like to start running and formalizing some of these benchmarks, so feel free to merge when you're ready!

import secrets

# This benchmark script is meant for testing out performance results based on different combinations of Pyarrow.compute functions for converter.
# Main performance factors are: latency and memory consumption.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is latency here meant to imply just end-to-end latency (i.e., minimize time to complete over compute resources required) or also efficiency (i.e., minimize compute resources required over time to complete)?

I expect the latter to likely be higher-priority for this case, but also expect the two figures to be closely related.

Copy link
Member Author

@Zyiqin-Miranda Zyiqin-Miranda Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here specifically for this script, latency is meant for end-to-end latency. Memory consumption is meant for minimize compute resources. I started this script just thinking about benchmarking different Pyarrow compute functions, but then I realized that to satisfy the deterministic record keep across table requirement, the options are quite limited. So this script end up being some sort of POC for drop duplicates for upsert deltas.

Copy link
Member Author

@Zyiqin-Miranda Zyiqin-Miranda Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also after writing the approaches here, I feel like the difference of "efficiency" and "end-to-end latency" between these three approaches seem to be affected by table characteristics like primary key duplication rate.
If table having a high pk duplication rate, we'd want to perform file-level drop duplicates as much as possible instead of taking in a large table and perform a global drop duplicates in the end (avoid OOM or too much required resources at once).
This kind of consideration is actually choosing to minimize the possibility of OOM by sacrificing some end-to-end latency.
But the overall memory required wouldn't be very different in my mind? As we need to take in pretty much same columns (pk hash columns, index columns, file sequence number columns, etc.). Plus, drop duplicates twice, based on 1. sort column within file 2. file sequence/stream position order

Copy link
Member

@pdames pdames Mar 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. These all sound like good observations to capture in comments regarding the different approaches you've tried here, as they may not be immediately apparent to others that are just looking over the various options at their disposal to drop duplicates. Going forward, as the compactor or delete converter's job run history allows it to learn more about the characteristics of a particular dataset it's maintaining, it can ideally make a more well-informed decision about the best approach to use for future job runs.

@pdames
Copy link
Member

pdames commented Mar 4, 2025

One nit regarding the file location - I think I'd like to start standardizing putting these types of developer sandbox scripts in a common location going forward. What do you think about moving this from deltacat/compute/converter/_dev/drop_duplicates_benchmark.py to deltacat/dev/_sandbox/compute/converter/drop_duplicates_benchmark.py?

We should also leave a note for other devs to know where they can put similar sandbox scripts in README-development.md (maybe under Coding Quirks). Having them all in a standard location like this should also make it easier to omit them from being accidentally packaged/deployed.

@Zyiqin-Miranda
Copy link
Member Author

Zyiqin-Miranda commented Mar 9, 2025

Moved dev benchmark scripts/example scripts to sandbox location. Merging.

@Zyiqin-Miranda Zyiqin-Miranda merged commit a392358 into 2.0 Mar 9, 2025
3 checks passed
anshumankomawar pushed a commit that referenced this pull request Mar 11, 2025
* Add converter drop duplicates experiment script

* move dev scripts to sandbox location

---------

Co-authored-by: Zhu <[email protected]>
Co-authored-by: Miranda <[email protected]>
@Zyiqin-Miranda Zyiqin-Miranda deleted the convert-drop-duplicates-benchmark branch March 13, 2025 04:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants