Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Converter][Test]Add working end-to-end converter compute test with Minio S3 endpoint #494

Merged
merged 2 commits into from
Mar 13, 2025

Conversation

Zyiqin-Miranda
Copy link
Member

Summary

This PR adds working end-to-end converter find deletes compute happy case test.
With 1. Minio S3 endpoint replaced in S3 file system 2. Daft download parquet file mocked,
We are able to test the compute part end-to-end.
Future dev work adding new features can use the same setup to test correctness.

Rationale

Explain the reasoning behind the changes and their benefits to the project.

Changes

List the major changes made in this pull request.

Impact

Discuss any potential impacts the changes may have on existing functionalities.

Testing

Describe how the changes have been tested, including both automated and manual testing strategies.
If this is a bugfix, explain how the fix has been tested to ensure the bug is resolved without introducing new issues.

Regression Risk

If this is a bugfix, assess the risk of regression caused by this fix and steps taken to mitigate it.

Checklist

  • Unit tests covering the changes have been added

    • If this is a bugfix, regression tests have been added
  • E2E testing has been performed

Additional Notes

  1. More test cases pending: null primary key, multiple primary key concatenating.
  2. We'd also want to add using Pyspark to read position delete we produced and assert same record count.

@Zyiqin-Miranda Zyiqin-Miranda force-pushed the tracking-2.0-converter-tests2 branch from 890e53f to 11ac5ad Compare March 5, 2025 06:23
@Zyiqin-Miranda Zyiqin-Miranda requested a review from pdames March 5, 2025 07:05
@@ -34,16 +34,27 @@ def convert(convert_input: ConvertInput):

logger.info(f"Starting convert task index: {convert_task_index}")
data_files, equality_delete_files, position_delete_files = files_for_each_bucket[1]
# Get string representation of partition value out of Record[partition_value]
partition_value_str = (
files_for_each_bucket[0].__repr__().split("[", 1)[1].split("]")[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code to extract partition values looks pretty unfortunately opaque (but I understand this happens semi-regularly, esp. when trying to access details like this that Iceberg prefers to hide from external users). Maybe we can at least put these one-liners inside of appropriately named helper methods?



_ORDERED_RECORD_IDX_COLUMN_NAME = _get_iceberg_col_name("pos")
_ORDERED_RECORD_IDX_COLUMN_TYPE = pa.int64()
_ORDERED_RECORD_IDX_FIELD_METADATA = {b"PARQUET:field_id": "2147483545"}
Copy link
Member

@pdames pdames Mar 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we (1) leave a link back to the relevant part of the Iceberg spec like https://iceberg.apache.org/spec/#reserved-field-ids so that we know where this magic number came from going forward and (2) store the field ID in a global constant to reuse here and other places that need it (e.g., tests)?

@@ -32,7 +35,10 @@ def append_record_idx_col(table: pa.Table, ordered_record_indices) -> pa.Table:

_FILE_PATH_COLUMN_NAME = _get_iceberg_col_name("file_path")
_FILE_PATH_COLUMN_TYPE = pa.string()
_FILE_PATH_FIELD_METADATA = {b"PARQUET:field_id": "2147483546"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here - can we leave a link back to Iceberg spec reserved field IDs and store the field ID in a global constant (e.g., either in this file or in constants.py)?

@Zyiqin-Miranda Zyiqin-Miranda force-pushed the tracking-2.0-converter-tests2 branch from 11ac5ad to 071328a Compare March 13, 2025 19:35
@Zyiqin-Miranda Zyiqin-Miranda force-pushed the tracking-2.0-converter-tests2 branch from 071328a to 2e0d530 Compare March 13, 2025 19:37
Copy link

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Performance Alert ⚠️

Possible performance regression was detected for benchmark.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 2.

Benchmark suite Current: 2e0d530 Previous: 11ac5ad Ratio
deltacat/tests/compute/test_compact_partition_incremental.py::test_compact_partition_incremental[1-incremental-pkstr-sknone-norcf_V1] 0.3144652253669833 iter/sec (stddev: 0) 0.9607882346645665 iter/sec (stddev: 0) 3.06

This comment was automatically generated by workflow using github-action-benchmark.

@Zyiqin-Miranda
Copy link
Member Author

Not touching compactor code, so performance alert is unrelated, addressed all comments, merging.

@Zyiqin-Miranda Zyiqin-Miranda merged commit bf9076c into 2.0 Mar 13, 2025
3 checks passed
@Zyiqin-Miranda Zyiqin-Miranda deleted the tracking-2.0-converter-tests2 branch March 13, 2025 20:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants