Skip to content

Conversation

@PatrickJin-db
Copy link
Collaborator

@PatrickJin-db PatrickJin-db commented Dec 11, 2024

Main differences compared to the prototype #609 include:

  • Not using a prototype branch of delta-kernel-rs
  • Better code organization
  • Ensuring that files are closed and tempfiles are cleaned up even upon failure
  • Fixed some of the tests

Regarding performance, most of the time is spent in kernel, and writing the delta log is negligible in comparison.
image

@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/cdf_python_changes branch 7 times, most recently from eda025a to 75835d3 Compare December 11, 2024 18:01
Copy link
Collaborator

@OussamaSaoudi-db OussamaSaoudi-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most pressing issue is updating to the main branch.

use std::sync::Arc;

use arrow::compute::filter_record_batch;
use arrow::datatypes::SchemaRef;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: Both arrow and kernel have their own SchemaRef. This is up to personal preference, but you may want to give this an alias so that the type of SchemaRef doesn't become confusing.

Suggested change
use arrow::datatypes::SchemaRef;
use arrow::datatypes::SchemaRef as ArrowSchemaRef;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may also want to do the same for TableChangesScan:

use delta_kernel::table_changes::scan::TableChangesScan as KernelTableChangesScan

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think ArrowSchemaRef makes sense, but TableChangesScan is only used once so I don't think using an alias for it is as beneficial.

ending_version=ending_version,
starting_timestamp=starting_timestamp,
ending_timestamp=ending_timestamp,
include_historical_metadata=use_delta_format
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmmm? why include_historical_metadata=use_delta_format?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see why, let's add a comment to this line about it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, PTAL

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

# os.utime accepts seconds while delta log timestamp is in ms
os.utime(log_file_path, times=(0, version_to_timestamp[version] // 1000))

if min_version > 0 and num_versions_with_action > 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In what case will num_versions_with_action be 0?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One case could be if only metadata changes occurred during the queried version range.

protocol_json = json.loads(next(lines))
metadata_json = json.loads(next(lines))
actions: List[FileAction] = []
for line in lines:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and let's try some logs for the ease of debugging in the future when issues happen.
Such as printing out # of lines received from the server.

@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/cdf_python_changes branch 8 times, most recently from 6cb527d to e72d39f Compare December 12, 2024 07:31
Copy link
Collaborator

@linzhou-db linzhou-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The python code changes look good!
Please get approval for the rust code before merging!
And could you also check the failure in the wheel build jobs and see if it's possible to fix them?

ending_version=ending_version,
starting_timestamp=starting_timestamp,
ending_timestamp=ending_timestamp,
include_historical_metadata=use_delta_format
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/cdf_python_changes branch 8 times, most recently from 22b77cc to 4032dc6 Compare December 18, 2024 01:46
@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/cdf_python_changes branch from 4032dc6 to 4a9f4db Compare December 18, 2024 02:42
@PatrickJin-db PatrickJin-db merged commit 73dea97 into delta-io:main Dec 18, 2024
5 checks passed
@PatrickJin-db PatrickJin-db mentioned this pull request Dec 18, 2024
charlenelyu-db pushed a commit to charlenelyu-db/delta-sharing that referenced this pull request Mar 25, 2025
charlenelyu-db added a commit that referenced this pull request Mar 27, 2025
* fix

* upgrade delta-kernel-rs (#607)

* update expected error msg in test (#608)

* rust bindings for CDF (#612)

Co-authored-by: Oussama Saoudi <[email protected]>

* python connector CDF reads (#613)

* fix version (#616)

* reference rust wrapper 0.2.0 (#617)

* add table changes with use_delta_format=True example (#619)

* Fix wheels build by removing duplicate wheels (#623)

* update delta-kernel-rust-sharing-wrapper to 0.2.1 and fix path in workflows (#625)

* update delta-kernel-rust-sharing-wrapper to 0.2.1

* fix names in build workflow

* Improvement python delta-sharing client: convert expires_in as string to int, if returned as string (#628)

**TL;DR:** This PR enhances the OAuth client to support cases where the expires_in field in the token response is returned as a string instead of an integer. While the OAuth 2.0 specification mandates that expires_in should be an integer  [RFC 6749 Section 4.1.4](https://datatracker.ietf.org/doc/html/rfc6749#section-4.1.4), some OAuth servers return it as a string, leading to potential compatibility issues.

Certain OAuth implementations deviate from the standard and return expires_in as a string, e.g.:

```
{
  "access_token": "example-token",
  "expires_in": "3600",  // Returned as a string
  "token_type": "Bearer"
}
```
This causes failures when the client expects the field to always be an integer.

Solution

This PR updates the token parsing logic to:
	1.	Check the type of the expires_in field.
	2.	Convert the value to an integer if it is provided as a string.
	3.	Maintain backward compatibility with the standard integer format.

* upgrade delta-kernel-rs to 0.6.1 to fix reading tables partitioned on timestamp (#634)

* include source in maturin build of delta-kernel-rust-sharing-wrapper (#647)

* update latest version that does not require the rust wrapper in README (#650)

* Support delta format for table protocol and metadata (#657)

* timestampntz support (#654)

* Fix timestampntz test (#662)

* fix python lint and reformat scripts (#668)

* ubuntu version

* fix

---------

Co-authored-by: Patrick Jin <[email protected]>
Co-authored-by: Oussama Saoudi <[email protected]>
Co-authored-by: Moe Derakhshani <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants