-
Couldn't load subscription status.
- Fork 537
Description
Environment
Delta-rs version: 0.22 (Tested multiple versions up to latest)
PyArrow version: 18.0.0
- OS: Mac 14.4 (23E214)
- Delta table on S3
Bug
What happened:
When applying a filter expression (lighting == "day") using pyarrow.dataset, no results are returned. However, if I do not apply the filter at this stage and instead filter the resulting pandas DataFrame (results[results["lighting"] == "day"]), I find that rows are filtered out, confirming that data matching the condition exists in the dataset.
What you expected to happen:
The filter method should correctly return rows where lighting == "day" when applied directly on the pyarrow.dataset.
How to reproduce it:
Given a delta table as such
CREATE TABLE hive_metastore.dwh.table_name (
key STRING,
...
lighting STRING,
...
h3_id_res9 BIGINT)
USING delta
PARTITIONED BY (h3_id_res9)
LOCATION 'dbfs:s3_path'
TBLPROPERTIES (
'delta.minReaderVersion' = '1',
'delta.minWriterVersion' = '2')# Python code
delta_table = get_delta_table(table_path, dynamo_table_name)
partitions = [("h3_id_res9", "in", str(608716487191953407))]
condition = pc.equal(ds.field("lighting"), "day")
# Apply filter directly on pyarrow dataset
results = (
delta_table.to_pyarrow_dataset(partitions=partitions)
.filter(expression=condition)
.to_table()
.to_pandas()
)
# Results are empty
assert results.empty, "Expected non-empty results, but got none."
# Remove filter and filter using pandas
results = (
dt.to_pyarrow_dataset(partitions=partitions)
.to_table()
.to_pandas()
)
results_filtered = results[results["lighting"] == "day"].reset_index(drop=True)
# Results are non-empty and rows were filtered as expected
assert not results_filtered.empty, "Expected non-empty results, but got none after pandas filtering."More Details:
- In other tables, I am able to filter the data, so I don't think it's tied to data type