-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Converter][Test]Add working end-to-end converter compute test with Minio S3 endpoint #494
Conversation
890e53f
to
11ac5ad
Compare
@@ -34,16 +34,27 @@ def convert(convert_input: ConvertInput): | |||
|
|||
logger.info(f"Starting convert task index: {convert_task_index}") | |||
data_files, equality_delete_files, position_delete_files = files_for_each_bucket[1] | |||
# Get string representation of partition value out of Record[partition_value] | |||
partition_value_str = ( | |||
files_for_each_bucket[0].__repr__().split("[", 1)[1].split("]")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code to extract partition values looks pretty unfortunately opaque (but I understand this happens semi-regularly, esp. when trying to access details like this that Iceberg prefers to hide from external users). Maybe we can at least put these one-liners inside of appropriately named helper methods?
|
||
|
||
_ORDERED_RECORD_IDX_COLUMN_NAME = _get_iceberg_col_name("pos") | ||
_ORDERED_RECORD_IDX_COLUMN_TYPE = pa.int64() | ||
_ORDERED_RECORD_IDX_FIELD_METADATA = {b"PARQUET:field_id": "2147483545"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we (1) leave a link back to the relevant part of the Iceberg spec like https://iceberg.apache.org/spec/#reserved-field-ids so that we know where this magic number came from going forward and (2) store the field ID in a global constant to reuse here and other places that need it (e.g., tests)?
@@ -32,7 +35,10 @@ def append_record_idx_col(table: pa.Table, ordered_record_indices) -> pa.Table: | |||
|
|||
_FILE_PATH_COLUMN_NAME = _get_iceberg_col_name("file_path") | |||
_FILE_PATH_COLUMN_TYPE = pa.string() | |||
_FILE_PATH_FIELD_METADATA = {b"PARQUET:field_id": "2147483546"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment here - can we leave a link back to Iceberg spec reserved field IDs and store the field ID in a global constant (e.g., either in this file or in constants.py
)?
11ac5ad
to
071328a
Compare
071328a
to
2e0d530
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 2
.
Benchmark suite | Current: 2e0d530 | Previous: 11ac5ad | Ratio |
---|---|---|---|
deltacat/tests/compute/test_compact_partition_incremental.py::test_compact_partition_incremental[1-incremental-pkstr-sknone-norcf_V1] |
0.3144652253669833 iter/sec (stddev: 0 ) |
0.9607882346645665 iter/sec (stddev: 0 ) |
3.06 |
This comment was automatically generated by workflow using github-action-benchmark.
Not touching compactor code, so performance alert is unrelated, addressed all comments, merging. |
Summary
This PR adds working end-to-end converter find deletes compute happy case test.
With 1. Minio S3 endpoint replaced in S3 file system 2. Daft download parquet file mocked,
We are able to test the compute part end-to-end.
Future dev work adding new features can use the same setup to test correctness.
Rationale
Explain the reasoning behind the changes and their benefits to the project.
Changes
List the major changes made in this pull request.
Impact
Discuss any potential impacts the changes may have on existing functionalities.
Testing
Describe how the changes have been tested, including both automated and manual testing strategies.
If this is a bugfix, explain how the fix has been tested to ensure the bug is resolved without introducing new issues.
Regression Risk
If this is a bugfix, assess the risk of regression caused by this fix and steps taken to mitigate it.
Checklist
Unit tests covering the changes have been added
E2E testing has been performed
Additional Notes