Skip to content

Schema merge failed since switch to Datafusion if a field is a list of structs #3339

Open
@liamphmurphy

Description

@liamphmurphy

Environment

Delta-rs version: v0.25.4 (see below for specifics)

Binding: Python, rust engine

Environment:
Local, S3


Bug

What happened:
Since the adoption of datafusion, it appears to struggling with schema merges if the originating table schema contains a list of structs (Pyarrow list for exact verbiage).

What you expected to happen:

Adding a non-list field to a schema with a list of structs field would merge, which worked previously.

How to reproduce it:

On v0.25.4, run the following Python code:

import pyarrow as pa
from deltalake import write_deltalake

# Define the path for the Delta table
delta_table_path = "./datafusion-repro-test-table"

# Define the data for the first write
data_first_write = [
    {
        "uid": "ws_2",
        "event": {
            "properties": {
                "fields": [
                    {
                        "messageId": "veniam sed et elit adipisicing"
                    }
                ],
            },
        }
    }
]

schema = pa.schema([
    pa.field("uid", pa.string()),
    pa.field("event", pa.struct([
        pa.field("properties", pa.struct([
            pa.field("fields", pa.list_(pa.struct([
                pa.field("messageId", pa.string()),
            ]))),
        ])),
    ])),
])

print(schema)



first_write = pa.Table.from_pylist(data_first_write, schema=schema)

# Write data to Delta table for the first write
write_deltalake(delta_table_path, first_write, mode="append", engine="rust", schema_mode="merge")

#### NOW FOR THE SECOND WRITE THAT BREAKS ####

data_second_write = [
    {
        "uid": "ws_2",
        "event": {
            "properties": {
                "someNewField": "test-value", # New field
                "fields": [
                    {
                        "messageId": "veniam sed et elit adipisicing"
                    }
                ],
            },
        }
    }
]

second_schema = pa.schema([
    pa.field("uid", pa.string()),
    pa.field("event", pa.struct([
        pa.field("properties", pa.struct([
            pa.field("someNewField", pa.string()), # New field
            pa.field("fields", pa.list_(pa.struct([
                pa.field("messageId", pa.string()),
            ]))),
        ])),
    ])),
])

second_write = pa.Table.from_pylist(data_second_write, schema=second_schema)

# Write data to Delta table for the second write
write_deltalake(delta_table_path, second_write, mode="append", engine="rust", schema_mode="merge")

More details:

The above code works as expected on the last version I was using, v0.19.2.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingon-holdIssues and Pull Requests that are on hold for some reason

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions