Skip to content

Add DeltaLake source support #31

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

danielbeach
Copy link

Add a new method called read_deltalake(). Simply read the Delta Lake table and get the list of parquet files for that Delta Lake table so we can reuse ParquetDataSet. (doing this by calling dt.files() which returns a list of parquet files).

@wangrunji0408
Copy link
Collaborator

As far as I know, Delta Lake is not simply a list of parquet files, right? I think we need a dedicated DeltaLakeDataset to handle them properly.

@danielbeach
Copy link
Author

As far as I know, Delta Lake is not simply a list of parquet files, right? I think we need a dedicated DeltaLakeDataset to handle them properly.

Yes, but we are using the deltalake package, and specifically an instance of a DeltaTable to give a list of the parquet files that currently make up the current version of the Delta Lake table via files() method they provide. I mean in theory we could add another dependecy of polars or daft or something else to return the Delta Lake as a Dataframe and the dump it to an arrow table, but that seems more indirect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants