Description
Environment
Delta-rs version: python-0.15.2 up until python-0.17.3
Binding: python
Bug
deltalake versions higher than v0.15.1 return an empty dataframe forget_add_actions
after a CREATE OR REPLACE TABLE AS SELECT
commit on our delta table.
What happened:
> delta_table = DeltaTable(input_path)
> partition_columns = delta_table.metadata().partition_columns
> partition_columns
This correctly shows the columns we partitioned on
> add_actions = delta_table.get_add_actions(flatten=True).to_pandas()
> add_actions.empty
True
The columns of the dataframe are present and correct.
What you expected to happen:
> delta_table = DeltaTable(input_path)
> add_actions = delta_table.get_add_actions(flatten=True).to_pandas()
> add_actions.empty
False
(and to have records of course ;)
More details:
We are interested in the values for the partitioned columns. Since deltalake does not add the partition columns when reading a DeltaTable we were using the get_add_actions
method to join the partition values onto the table using the _file_uri
.
(side note: if anyone knows of a better way to do this, I would be happy to hear).
Initially, we found no issues switching to v0.17.3. However, the values for the partition columns were suddenly missing. We looked at the table history (see below) and found that it broke after an ETL job caused a CREATE OR REPLACE TABLE AS SELECT
.
We found that with v0.17.3 it still worked for version our table version 217 but is broken from version 218 onwards.
We tried all version in between to pin-point which release broke it, and it appears to be broken from >= v0.15.2
v0.15.1 continues to work for our delta table after this commit and does not display this behavior.
version | timestamp | operation | operationParameters |
---|---|---|---|
221 | 2024-05-07T23:22:57.000+00:00 | WRITE | {"mode":"Append","statsOnLoad":"false","partitionBy":"[]"} |
220 | 2024-05-07T23:21:29.000+00:00 | DELETE | {"predicate":"["(sales_date#298771 >= 2024-04-24)"]"} |
219 | 2024-05-07T18:00:06.000+00:00 | CREATE OR REPLACE TABLE AS SELECT | {"partitionBy":"["item_in_promo","year","item"]","description":null,"isManaged":"true","properties":"{"delta.checkpoint.writeStatsAsStruct":"true","delta.checkpoint.writeStatsAsJson":"false"}","statsOnLoad":"false"} |
218 | 2024-05-07T08:41:21.000+00:00 | CREATE OR REPLACE TABLE AS SELECT | {"partitionBy":"["item_in_promo","year","item"]","description":null,"isManaged":"true","properties":"{"delta.checkpoint.writeStatsAsStruct":"true","delta.checkpoint.writeStatsAsJson":"false"}","statsOnLoad":"false"} |
217 | 2024-05-06T23:39:44.000+00:00 | WRITE | {"mode":"Append","statsOnLoad":"false","partitionBy":"[]"} |
We are not familiar enough with the codebase and Rust to determine whether this is by design or broken
How to reproduce it:
We were unable to. We tried to do the bewlo on a small example:
CREATE OR REPLACE TABLE tmp.delta_table
USING DELTA
PARTITIONED BY (partition_col)
TBLPROPERTIES (
delta.checkpoint.writeStatsAsJson=false,
delta.checkpoint.writeStatsAsStruct=true,
delta.minReaderVersion=1,
delta.minWriterVersion=2
)
AS
SELECT 1 AS partition_col, 'value' AS something
INSERT INTO tmp.delta_table
VALUES
(1, 'value1'),
(2, 'value2'),
(3, 'value3')
INSERT INTO tmp.delta_table
VALUES
(4, 'value4'),
(5, 'value5'),
(6, 'value6')
Reading is just fine, then we do the potentially troublesome operation:
CREATE OR REPLACE TABLE tmp.delta_table
USING DELTA
PARTITIONED BY (partition_col)
TBLPROPERTIES (delta.checkpoint.writeStatsAsJson=false,delta.checkpoint.writeStatsAsStruct=true)
AS SELECT 42 AS partition_col, 'value42' AS something
But afterwards, we are still able to read using v0.17.3