-
Couldn't load subscription status.
- Fork 208
Python connector CDF reads kernel integration #613
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python connector CDF reads kernel integration #613
Conversation
eda025a to
75835d3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most pressing issue is updating to the main branch.
| use std::sync::Arc; | ||
|
|
||
| use arrow::compute::filter_record_batch; | ||
| use arrow::datatypes::SchemaRef; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: Both arrow and kernel have their own SchemaRef. This is up to personal preference, but you may want to give this an alias so that the type of SchemaRef doesn't become confusing.
| use arrow::datatypes::SchemaRef; | |
| use arrow::datatypes::SchemaRef as ArrowSchemaRef; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may also want to do the same for TableChangesScan:
use delta_kernel::table_changes::scan::TableChangesScan as KernelTableChangesScanThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think ArrowSchemaRef makes sense, but TableChangesScan is only used once so I don't think using an alias for it is as beneficial.
| ending_version=ending_version, | ||
| starting_timestamp=starting_timestamp, | ||
| ending_timestamp=ending_timestamp, | ||
| include_historical_metadata=use_delta_format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm? why include_historical_metadata=use_delta_format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see why, let's add a comment to this line about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
| # os.utime accepts seconds while delta log timestamp is in ms | ||
| os.utime(log_file_path, times=(0, version_to_timestamp[version] // 1000)) | ||
|
|
||
| if min_version > 0 and num_versions_with_action > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what case will num_versions_with_action be 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One case could be if only metadata changes occurred during the queried version range.
| protocol_json = json.loads(next(lines)) | ||
| metadata_json = json.loads(next(lines)) | ||
| actions: List[FileAction] = [] | ||
| for line in lines: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and let's try some logs for the ease of debugging in the future when issues happen.
Such as printing out # of lines received from the server.
6cb527d to
e72d39f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The python code changes look good!
Please get approval for the rust code before merging!
And could you also check the failure in the wheel build jobs and see if it's possible to fix them?
| ending_version=ending_version, | ||
| starting_timestamp=starting_timestamp, | ||
| ending_timestamp=ending_timestamp, | ||
| include_historical_metadata=use_delta_format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good!
22b77cc to
4032dc6
Compare
4032dc6 to
4a9f4db
Compare
* fix * upgrade delta-kernel-rs (#607) * update expected error msg in test (#608) * rust bindings for CDF (#612) Co-authored-by: Oussama Saoudi <[email protected]> * python connector CDF reads (#613) * fix version (#616) * reference rust wrapper 0.2.0 (#617) * add table changes with use_delta_format=True example (#619) * Fix wheels build by removing duplicate wheels (#623) * update delta-kernel-rust-sharing-wrapper to 0.2.1 and fix path in workflows (#625) * update delta-kernel-rust-sharing-wrapper to 0.2.1 * fix names in build workflow * Improvement python delta-sharing client: convert expires_in as string to int, if returned as string (#628) **TL;DR:** This PR enhances the OAuth client to support cases where the expires_in field in the token response is returned as a string instead of an integer. While the OAuth 2.0 specification mandates that expires_in should be an integer [RFC 6749 Section 4.1.4](https://datatracker.ietf.org/doc/html/rfc6749#section-4.1.4), some OAuth servers return it as a string, leading to potential compatibility issues. Certain OAuth implementations deviate from the standard and return expires_in as a string, e.g.: ``` { "access_token": "example-token", "expires_in": "3600", // Returned as a string "token_type": "Bearer" } ``` This causes failures when the client expects the field to always be an integer. Solution This PR updates the token parsing logic to: 1. Check the type of the expires_in field. 2. Convert the value to an integer if it is provided as a string. 3. Maintain backward compatibility with the standard integer format. * upgrade delta-kernel-rs to 0.6.1 to fix reading tables partitioned on timestamp (#634) * include source in maturin build of delta-kernel-rust-sharing-wrapper (#647) * update latest version that does not require the rust wrapper in README (#650) * Support delta format for table protocol and metadata (#657) * timestampntz support (#654) * Fix timestampntz test (#662) * fix python lint and reformat scripts (#668) * ubuntu version * fix --------- Co-authored-by: Patrick Jin <[email protected]> Co-authored-by: Oussama Saoudi <[email protected]> Co-authored-by: Moe Derakhshani <[email protected]>
Main differences compared to the prototype #609 include:
Regarding performance, most of the time is spent in kernel, and writing the delta log is negligible in comparison.
