Skip to content

to_pyarrow_dataset raises error for nested struct of list #3063

@jensqin

Description

@jensqin

Environment

Delta-rs version: 0.22.3

Binding: python

Environment:

  • Cloud provider: None
  • OS: macOS
  • Other: polars 1.17.1, pyarrow 18.1.0

Bug

What happened:
Upgrading to deltalake 0.22 causes issues with nested lists. Previously structs of three columns a, b, c in pyarrow schema looks like

struct_with_list: struct<a: int32, b: float, c: list<item: int32>> not null
  child 0, a: int32
  child 1, b: float
  child 2, c: list<item: int32>
      child 0, item: int32

but now with 0.22 it is

struct_with_list: struct<a: int32, b: float, c: list<element: int32>> not null
  child 0, a: int32
  child 1, b: float
  child 2, c: list<element: int32>
      child 0, element: int32

They look similar despite "item" and "element". The problem is that when pl.scan_pyarrow_dataset(deltatable.to_pyarrow_dataset()) was called, it raised an error

polars.exceptions.ComputeError: ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<c0: int32, c1: float, c2: list<item: int32>> output fields: struct<a: int32, b: float, c: list<element: int32>>

It looks like the column names are missing and internally the column names are c0, c1 and c2. When I rename the column in schema to c0, c1 and c2, the error disappears.

What you expected to happen:
Allow arbitrary column names like before 0.22

How to reproduce it:

import shutil

import polars as pl
import pyarrow as pa
from deltalake import DeltaTable

def main():
    shutil.rmtree("tmp")
    dt = DeltaTable.create(
        "tmp",
        schema=pa.schema(
            [
                ("a", pa.int32()),
                ("b", pa.string()),
                ("c", pa.struct({"d": pa.int16(), "e": pa.int16()})),
            ]
        ),
    )

    df = pl.DataFrame(
        {
            "a": [0, 1],
            "b": ["x", "y"],
            "c": [
                {"d": -55, "e": -32},
                {"d": 0, "e": 0},
            ],
        }
    )

    dt.merge(
        source=df.to_arrow(),
        predicate=" AND ".join([f"target.{x} = source.{x}" for x in ["a"]]),
        source_alias="source",
        target_alias="target",
        large_dtypes=False,
    ).when_not_matched_insert_all().execute()

    arrow_dt = dt.to_pyarrow_dataset()
    new_df = pl.scan_pyarrow_dataset(arrow_dt)
    new_df.collect()

raises error

ArrowTypeError: struct fields don't match or are in the wrong order: Input fields: struct<c0: int64, c1: int64> output fields: struct<d: int16, e: int16>

More details:

Metadata

Metadata

Assignees

Labels

binding/pythonIssues for the Python packagebugSomething isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions