-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add converter drop duplicates experiment script #492
Conversation
d54347c
to
bf8b360
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! I'd also like to start running and formalizing some of these benchmarks, so feel free to merge when you're ready!
import secrets | ||
|
||
# This benchmark script is meant for testing out performance results based on different combinations of Pyarrow.compute functions for converter. | ||
# Main performance factors are: latency and memory consumption. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is latency here meant to imply just end-to-end latency (i.e., minimize time to complete over compute resources required) or also efficiency (i.e., minimize compute resources required over time to complete)?
I expect the latter to likely be higher-priority for this case, but also expect the two figures to be closely related.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here specifically for this script, latency is meant for end-to-end latency. Memory consumption is meant for minimize compute resources. I started this script just thinking about benchmarking different Pyarrow compute functions, but then I realized that to satisfy the deterministic record keep across table
requirement, the options are quite limited. So this script end up being some sort of POC for drop duplicates
for upsert deltas.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also after writing the approaches here, I feel like the difference of "efficiency" and "end-to-end latency" between these three approaches seem to be affected by table characteristics like primary key duplication rate.
If table having a high pk duplication rate, we'd want to perform file-level drop duplicates as much as possible instead of taking in a large table and perform a global drop duplicates in the end (avoid OOM or too much required resources at once).
This kind of consideration is actually choosing to minimize the possibility of OOM by sacrificing some end-to-end latency.
But the overall memory required wouldn't be very different in my mind? As we need to take in pretty much same columns (pk hash columns, index columns, file sequence number columns, etc.). Plus, drop duplicates twice, based on 1. sort column within file 2. file sequence/stream position order
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. These all sound like good observations to capture in comments regarding the different approaches you've tried here, as they may not be immediately apparent to others that are just looking over the various options at their disposal to drop duplicates. Going forward, as the compactor or delete converter's job run history allows it to learn more about the characteristics of a particular dataset it's maintaining, it can ideally make a more well-informed decision about the best approach to use for future job runs.
One nit regarding the file location - I think I'd like to start standardizing putting these types of developer sandbox scripts in a common location going forward. What do you think about moving this from We should also leave a note for other devs to know where they can put similar sandbox scripts in |
Moved dev benchmark scripts/example scripts to sandbox location. Merging. |
* Add converter drop duplicates experiment script * move dev scripts to sandbox location --------- Co-authored-by: Zhu <[email protected]> Co-authored-by: Miranda <[email protected]>
Summary
This PR adds the experiment script for converter drop duplicates operations.
Approach explored satisfy the requirement that for same primary key, deterministic record will be keep based on following criteria:
file_sequence_number
in Iceberg term orstream_position
in DeltaCAT term will be kept.Results/comments are added as doc strings.
References: drop duplicates performance benchmark
Checklist
Unit tests covering the changes have been added
E2E testing has been performed
Additional Notes
Any additional information or context relevant to this PR.