Skip to content

get_add_actions does not return any records #2507

Open
@antonsteenvoorden

Description

@antonsteenvoorden

Environment

Delta-rs version: python-0.15.2 up until python-0.17.3

Binding: python


Bug

deltalake versions higher than v0.15.1 return an empty dataframe forget_add_actions after a CREATE OR REPLACE TABLE AS SELECT commit on our delta table.

What happened:

> delta_table = DeltaTable(input_path)
> partition_columns = delta_table.metadata().partition_columns
> partition_columns

This correctly shows the columns we partitioned on

> add_actions = delta_table.get_add_actions(flatten=True).to_pandas()
> add_actions.empty
True

The columns of the dataframe are present and correct.

What you expected to happen:

> delta_table = DeltaTable(input_path)
> add_actions = delta_table.get_add_actions(flatten=True).to_pandas()
> add_actions.empty
False

(and to have records of course ;)

More details:
We are interested in the values for the partitioned columns. Since deltalake does not add the partition columns when reading a DeltaTable we were using the get_add_actions method to join the partition values onto the table using the _file_uri.
(side note: if anyone knows of a better way to do this, I would be happy to hear).

Initially, we found no issues switching to v0.17.3. However, the values for the partition columns were suddenly missing. We looked at the table history (see below) and found that it broke after an ETL job caused a CREATE OR REPLACE TABLE AS SELECT.

We found that with v0.17.3 it still worked for version our table version 217 but is broken from version 218 onwards.
We tried all version in between to pin-point which release broke it, and it appears to be broken from >= v0.15.2

v0.15.1 continues to work for our delta table after this commit and does not display this behavior.

version timestamp operation operationParameters
221 2024-05-07T23:22:57.000+00:00 WRITE {"mode":"Append","statsOnLoad":"false","partitionBy":"[]"}
220 2024-05-07T23:21:29.000+00:00 DELETE {"predicate":"["(sales_date#298771 >= 2024-04-24)"]"}
219 2024-05-07T18:00:06.000+00:00 CREATE OR REPLACE TABLE AS SELECT {"partitionBy":"["item_in_promo","year","item"]","description":null,"isManaged":"true","properties":"{"delta.checkpoint.writeStatsAsStruct":"true","delta.checkpoint.writeStatsAsJson":"false"}","statsOnLoad":"false"}
218 2024-05-07T08:41:21.000+00:00 CREATE OR REPLACE TABLE AS SELECT {"partitionBy":"["item_in_promo","year","item"]","description":null,"isManaged":"true","properties":"{"delta.checkpoint.writeStatsAsStruct":"true","delta.checkpoint.writeStatsAsJson":"false"}","statsOnLoad":"false"}
217 2024-05-06T23:39:44.000+00:00 WRITE {"mode":"Append","statsOnLoad":"false","partitionBy":"[]"}

We are not familiar enough with the codebase and Rust to determine whether this is by design or broken

How to reproduce it:

We were unable to. We tried to do the bewlo on a small example:

CREATE OR REPLACE TABLE tmp.delta_table
USING DELTA
PARTITIONED BY (partition_col)
TBLPROPERTIES (
  delta.checkpoint.writeStatsAsJson=false,
  delta.checkpoint.writeStatsAsStruct=true,
  delta.minReaderVersion=1,
  delta.minWriterVersion=2
  )
AS
SELECT 1 AS partition_col, 'value' AS something
INSERT INTO tmp.delta_table
VALUES
(1, 'value1'),
(2, 'value2'),
(3, 'value3')
INSERT INTO tmp.delta_table
VALUES
(4, 'value4'),
(5, 'value5'),
(6, 'value6')

Reading is just fine, then we do the potentially troublesome operation:

CREATE OR REPLACE TABLE tmp.delta_table 
USING DELTA
PARTITIONED BY (partition_col)
TBLPROPERTIES (delta.checkpoint.writeStatsAsJson=false,delta.checkpoint.writeStatsAsStruct=true)
AS SELECT 42 AS partition_col, 'value42' AS something 

But afterwards, we are still able to read using v0.17.3

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions