Add converter drop duplicates experiment script #492

Zyiqin-Miranda · 2025-03-03T07:07:02Z

Summary

This PR adds the experiment script for converter drop duplicates operations.
Approach explored satisfy the requirement that for same primary key, deterministic record will be keep based on following criteria:

Records within same file: record with customized sort_order column value will be kept. Default to record index, last record index will be kept.
Records across files: Records with largest file_sequence_number in Iceberg term or stream_position in DeltaCAT term will be kept.

Results/comments are added as doc strings.
References: drop duplicates performance benchmark

Checklist

Unit tests covering the changes have been added
- If this is a bugfix, regression tests have been added
E2E testing has been performed

Additional Notes

Any additional information or context relevant to this PR.

pdames

Very cool! I'd also like to start running and formalizing some of these benchmarks, so feel free to merge when you're ready!

pdames · 2025-03-03T08:23:32Z

deltacat/compute/converter/_dev/drop_duplicates_benchmark.py

+import secrets
+
+# This benchmark script is meant for testing out performance results based on different combinations of Pyarrow.compute functions for converter.
+# Main performance factors are: latency and memory consumption.


Is latency here meant to imply just end-to-end latency (i.e., minimize time to complete over compute resources required) or also efficiency (i.e., minimize compute resources required over time to complete)?

I expect the latter to likely be higher-priority for this case, but also expect the two figures to be closely related.

Here specifically for this script, latency is meant for end-to-end latency. Memory consumption is meant for minimize compute resources. I started this script just thinking about benchmarking different Pyarrow compute functions, but then I realized that to satisfy the deterministic record keep across table requirement, the options are quite limited. So this script end up being some sort of POC for drop duplicates for upsert deltas.

Also after writing the approaches here, I feel like the difference of "efficiency" and "end-to-end latency" between these three approaches seem to be affected by table characteristics like primary key duplication rate.
If table having a high pk duplication rate, we'd want to perform file-level drop duplicates as much as possible instead of taking in a large table and perform a global drop duplicates in the end (avoid OOM or too much required resources at once).
This kind of consideration is actually choosing to minimize the possibility of OOM by sacrificing some end-to-end latency.
But the overall memory required wouldn't be very different in my mind? As we need to take in pretty much same columns (pk hash columns, index columns, file sequence number columns, etc.). Plus, drop duplicates twice, based on 1. sort column within file 2. file sequence/stream position order

Makes sense. These all sound like good observations to capture in comments regarding the different approaches you've tried here, as they may not be immediately apparent to others that are just looking over the various options at their disposal to drop duplicates. Going forward, as the compactor or delete converter's job run history allows it to learn more about the characteristics of a particular dataset it's maintaining, it can ideally make a more well-informed decision about the best approach to use for future job runs.

pdames · 2025-03-04T05:23:38Z

One nit regarding the file location - I think I'd like to start standardizing putting these types of developer sandbox scripts in a common location going forward. What do you think about moving this from deltacat/compute/converter/_dev/drop_duplicates_benchmark.py to deltacat/dev/_sandbox/compute/converter/drop_duplicates_benchmark.py?

We should also leave a note for other devs to know where they can put similar sandbox scripts in README-development.md (maybe under Coding Quirks). Having them all in a standard location like this should also make it easier to omit them from being accidentally packaged/deployed.

Zyiqin-Miranda · 2025-03-09T04:54:42Z

Moved dev benchmark scripts/example scripts to sandbox location. Merging.

* Add converter drop duplicates experiment script * move dev scripts to sandbox location --------- Co-authored-by: Zhu <[email protected]> Co-authored-by: Miranda <[email protected]>

Add converter drop duplicates experiment script

bf8b360

Zyiqin-Miranda force-pushed the convert-drop-duplicates-benchmark branch from d54347c to bf8b360 Compare March 3, 2025 19:12

Zyiqin-Miranda requested a review from pdames March 3, 2025 19:13

pdames approved these changes Mar 4, 2025

View reviewed changes

move dev scripts to sandbox location

8d64086

Zyiqin-Miranda merged commit a392358 into 2.0 Mar 9, 2025
3 checks passed

Zyiqin-Miranda deleted the convert-drop-duplicates-benchmark branch March 13, 2025 04:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add converter drop duplicates experiment script #492

Add converter drop duplicates experiment script #492

Uh oh!

Zyiqin-Miranda commented Mar 3, 2025

Uh oh!

pdames left a comment

Uh oh!

pdames Mar 3, 2025

Uh oh!

Zyiqin-Miranda Mar 4, 2025 •

edited

Loading

Uh oh!

Zyiqin-Miranda Mar 4, 2025 •

edited

Loading

Uh oh!

pdames Mar 4, 2025 •

edited

Loading

Uh oh!

pdames commented Mar 4, 2025 •

edited

Loading

Uh oh!

Zyiqin-Miranda commented Mar 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Add converter drop duplicates experiment script #492

Add converter drop duplicates experiment script #492

Uh oh!

Conversation

Zyiqin-Miranda commented Mar 3, 2025

Summary

Checklist

Additional Notes

Uh oh!

pdames left a comment

Choose a reason for hiding this comment

Uh oh!

pdames Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

Zyiqin-Miranda Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Zyiqin-Miranda Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pdames Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pdames commented Mar 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Zyiqin-Miranda commented Mar 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Zyiqin-Miranda Mar 4, 2025 •

edited

Loading

Zyiqin-Miranda Mar 4, 2025 •

edited

Loading

pdames Mar 4, 2025 •

edited

Loading

pdames commented Mar 4, 2025 •

edited

Loading

Zyiqin-Miranda commented Mar 9, 2025 •

edited

Loading