Description
Environment
Delta-rs version:
0.25.5
Binding:
Rust, Python
Environment:
- Cloud provider: Azure
- OS: macOS, Linux
- Other:
Bug
What happened:
Our tables write checkpoints with statistics written as structs, delta.checkpoint.writeStatsAsStruct = true and delta.checkpoint.writeStatsAsJson = false
After a checkpoint if you call add_actions_table looking at statistics:
- it only checks for existence of
stats
onAdd
s vs includingstats_parsed
as well: https://github.com/delta-io/delta-rs/blob/python-v0.25.5/crates/core/src/table/state_arrow.rs#L98 - probably because the files iterator used internally uses read_adds which does not set
stats_parsed
What you expected to happen:
I expect add_actions_table to have statistics available regardless of what the latest checkpoint is and how the stats were written to it
How to reproduce it:
- Configure table with delta.checkpoint.writeStatsAsStruct = true and delta.checkpoint.writeStatsAsJson = false
- Write data
- Checkpoint
- call add_actions_table
- observe no stats are present
More details:
log_data method is probably usable here for add_actions_table since it already has the data in arrow format AND it hydrates stats regardless of how they are represented in checkpoints or not.
It would just need a method on FileStatsAccessor to build a record batch out of its internal columns.
As a workaround I can probably enable json stats in addition to struct stats in checkpoints for little overhead.
Our use case is we make the add_actions_table queryable with datafusion to provide a sql function to explore delta table stats.