-
Couldn't load subscription status.
- Fork 537
Description
Environment
Delta-rs version: 0.22.3
Binding: python
Environment:
- Cloud provider: None
- OS: macOS
- Other: polars 1.17.1, pyarrow 18.1.0
Bug
What happened:
Upgrading to deltalake 0.22 causes issues with nested lists. Previously structs of three columns a, b, c in pyarrow schema looks like
struct_with_list: struct<a: int32, b: float, c: list<item: int32>> not null
child 0, a: int32
child 1, b: float
child 2, c: list<item: int32>
child 0, item: int32
but now with 0.22 it is
struct_with_list: struct<a: int32, b: float, c: list<element: int32>> not null
child 0, a: int32
child 1, b: float
child 2, c: list<element: int32>
child 0, element: int32
They look similar despite "item" and "element". The problem is that when pl.scan_pyarrow_dataset(deltatable.to_pyarrow_dataset()) was called, it raised an error
polars.exceptions.ComputeError: ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<c0: int32, c1: float, c2: list<item: int32>> output fields: struct<a: int32, b: float, c: list<element: int32>>
It looks like the column names are missing and internally the column names are c0, c1 and c2. When I rename the column in schema to c0, c1 and c2, the error disappears.
What you expected to happen:
Allow arbitrary column names like before 0.22
How to reproduce it:
import shutil
import polars as pl
import pyarrow as pa
from deltalake import DeltaTable
def main():
shutil.rmtree("tmp")
dt = DeltaTable.create(
"tmp",
schema=pa.schema(
[
("a", pa.int32()),
("b", pa.string()),
("c", pa.struct({"d": pa.int16(), "e": pa.int16()})),
]
),
)
df = pl.DataFrame(
{
"a": [0, 1],
"b": ["x", "y"],
"c": [
{"d": -55, "e": -32},
{"d": 0, "e": 0},
],
}
)
dt.merge(
source=df.to_arrow(),
predicate=" AND ".join([f"target.{x} = source.{x}" for x in ["a"]]),
source_alias="source",
target_alias="target",
large_dtypes=False,
).when_not_matched_insert_all().execute()
arrow_dt = dt.to_pyarrow_dataset()
new_df = pl.scan_pyarrow_dataset(arrow_dt)
new_df.collect()raises error
ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<c0: int64, c1: int64> output fields: struct<d: int16, e: int16>
More details: