[Converter][Test]Add working end-to-end converter compute test with Minio S3 endpoint #494

Zyiqin-Miranda · 2025-03-05T05:21:54Z

Summary

This PR adds working end-to-end converter find deletes compute happy case test.
With 1. Minio S3 endpoint replaced in S3 file system 2. Daft download parquet file mocked,
We are able to test the compute part end-to-end.
Future dev work adding new features can use the same setup to test correctness.

Rationale

Explain the reasoning behind the changes and their benefits to the project.

Changes

List the major changes made in this pull request.

Impact

Discuss any potential impacts the changes may have on existing functionalities.

Testing

Describe how the changes have been tested, including both automated and manual testing strategies.
If this is a bugfix, explain how the fix has been tested to ensure the bug is resolved without introducing new issues.

Regression Risk

If this is a bugfix, assess the risk of regression caused by this fix and steps taken to mitigate it.

Checklist

Unit tests covering the changes have been added
- If this is a bugfix, regression tests have been added
E2E testing has been performed

Additional Notes

More test cases pending: null primary key, multiple primary key concatenating.
We'd also want to add using Pyspark to read position delete we produced and assert same record count.

pdames · 2025-03-11T03:30:29Z

deltacat/compute/converter/steps/convert.py

@@ -34,16 +34,27 @@ def convert(convert_input: ConvertInput):

    logger.info(f"Starting convert task index: {convert_task_index}")
    data_files, equality_delete_files, position_delete_files = files_for_each_bucket[1]
+    # Get string representation of partition value out of Record[partition_value]
+    partition_value_str = (
+        files_for_each_bucket[0].__repr__().split("[", 1)[1].split("]")[0]


The code to extract partition values looks pretty unfortunately opaque (but I understand this happens semi-regularly, esp. when trying to access details like this that Iceberg prefers to hide from external users). Maybe we can at least put these one-liners inside of appropriately named helper methods?

pdames · 2025-03-11T03:33:30Z

deltacat/compute/converter/utils/iceberg_columns.py



 _ORDERED_RECORD_IDX_COLUMN_NAME = _get_iceberg_col_name("pos")
 _ORDERED_RECORD_IDX_COLUMN_TYPE = pa.int64()
+_ORDERED_RECORD_IDX_FIELD_METADATA = {b"PARQUET:field_id": "2147483545"}


Can we (1) leave a link back to the relevant part of the Iceberg spec like https://iceberg.apache.org/spec/#reserved-field-ids so that we know where this magic number came from going forward and (2) store the field ID in a global constant to reuse here and other places that need it (e.g., tests)?

pdames · 2025-03-11T03:35:14Z

deltacat/compute/converter/utils/iceberg_columns.py

@@ -32,7 +35,10 @@ def append_record_idx_col(table: pa.Table, ordered_record_indices) -> pa.Table:

 _FILE_PATH_COLUMN_NAME = _get_iceberg_col_name("file_path")
 _FILE_PATH_COLUMN_TYPE = pa.string()
+_FILE_PATH_FIELD_METADATA = {b"PARQUET:field_id": "2147483546"}


Same comment here - can we leave a link back to Iceberg spec reserved field IDs and store the field ID in a global constant (e.g., either in this file or in constants.py)?

…inio S3 endpoint

github-actions

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 2.

Benchmark suite	Current: `2e0d530`	Previous: `11ac5ad`	Ratio
`deltacat/tests/compute/test_compact_partition_incremental.py::test_compact_partition_incremental[1-incremental-pkstr-sknone-norcf_V1]`	`0.3144652253669833` iter/sec (`stddev: 0`)	`0.9607882346645665` iter/sec (`stddev: 0`)	`3.06`

This comment was automatically generated by workflow using github-action-benchmark.

Zyiqin-Miranda · 2025-03-13T20:16:34Z

Not touching compactor code, so performance alert is unrelated, addressed all comments, merging.

…inio S3 endpoint (ray-project#494) * [Converter][Test]Add working end-to-end converter compute test with Minio S3 endpoint * address comments --------- Co-authored-by: Miranda <[email protected]>

Zyiqin-Miranda force-pushed the tracking-2.0-converter-tests2 branch from 890e53f to 11ac5ad Compare March 5, 2025 06:23

Zyiqin-Miranda requested a review from pdames March 5, 2025 07:05

pdames approved these changes Mar 11, 2025

View reviewed changes

[Converter][Test]Add working end-to-end converter compute test with M…

b396fde

…inio S3 endpoint

Zyiqin-Miranda force-pushed the tracking-2.0-converter-tests2 branch from 11ac5ad to 071328a Compare March 13, 2025 19:35

address comments

2e0d530

Zyiqin-Miranda force-pushed the tracking-2.0-converter-tests2 branch from 071328a to 2e0d530 Compare March 13, 2025 19:37

github-actions bot reviewed Mar 13, 2025

View reviewed changes

Zyiqin-Miranda merged commit bf9076c into 2.0 Mar 13, 2025
3 checks passed

Zyiqin-Miranda deleted the tracking-2.0-converter-tests2 branch March 13, 2025 20:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Converter][Test]Add working end-to-end converter compute test with Minio S3 endpoint #494

[Converter][Test]Add working end-to-end converter compute test with Minio S3 endpoint #494

Uh oh!

Zyiqin-Miranda commented Mar 5, 2025

Uh oh!

pdames Mar 11, 2025

Uh oh!

pdames Mar 11, 2025 •

edited

Loading

Uh oh!

pdames Mar 11, 2025

Uh oh!

github-actions bot left a comment •

edited

Loading

Uh oh!

Zyiqin-Miranda commented Mar 13, 2025

Uh oh!

Uh oh!

Uh oh!

[Converter][Test]Add working end-to-end converter compute test with Minio S3 endpoint #494

[Converter][Test]Add working end-to-end converter compute test with Minio S3 endpoint #494

Uh oh!

Conversation

Zyiqin-Miranda commented Mar 5, 2025

Summary

Rationale

Changes

Impact

Testing

Regression Risk

Checklist

Additional Notes

Uh oh!

pdames Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

pdames Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pdames Mar 11, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

⚠️ Performance Alert ⚠️

Uh oh!

Zyiqin-Miranda commented Mar 13, 2025

Uh oh!

Uh oh!

Uh oh!

pdames Mar 11, 2025 •

edited

Loading

github-actions bot left a comment •

edited

Loading