Skip to content

deltalake mixing struct columns based on order #3750

@dabljues

Description

@dabljues

Environment

Delta-rs version: not sure?

Binding: 1.1.4


Bug

When dumping a DataFrame with structs into deltalake, the data gets tainted if the order of the fields in a struct changes. If it would be just a list of strings - sure thing, but since structs have named keys - it shouldn't happen AFAIK.

Best to show with an example:

import os
import shutil

import polars as pl
from deltalake import write_deltalake

DELTA_PATH = "data/test_schema"

if os.path.exists(DELTA_PATH):
    shutil.rmtree(DELTA_PATH)

a = {"user_data": [{"name": "John", "surname": "Smith"}]}
b = {"user_data": [{"surname": "Brown", "name": "Mark"}]}


def unnest_df(df: pl.DataFrame) -> pl.DataFrame:
    return df.explode("user_data").unnest("user_data")


df_a = pl.from_records([a])
print(unnest_df(df_a))
df_b = pl.from_records([b])
print(unnest_df(df_b))

write_deltalake(DELTA_PATH, df_a, mode="append", schema_mode="merge")
write_deltalake(DELTA_PATH, df_b, mode="append", schema_mode="merge")

df = pl.read_delta(DELTA_PATH)

print(df)
print(unnest_df(df))
print(df["user_data"].to_list())

So, one DataFrame that has a column with a list[struct] with name and surname fields. The second dataframe is identical, but the keys in the struct are: surname, name (the same, but changed order). I dump both into deltalake and the result is this: (I used unnesting so it's easier to see):

shape: (1, 2)
┌──────┬─────────┐
│ name ┆ surname │
│ ---  ┆ ---     │
│ str  ┆ str     │
╞══════╪═════════╡
│ John ┆ Smith   │
└──────┴─────────┘
shape: (1, 2)
┌─────────┬──────┐
│ surname ┆ name │
│ ---     ┆ ---  │
│ str     ┆ str  │
╞═════════╪══════╡
│ Brown   ┆ Mark │
└─────────┴──────┘
shape: (2, 2)
┌───────┬─────────┐
│ name  ┆ surname │
│ ---   ┆ ---     │
│ str   ┆ str     │
╞═══════╪═════════╡
│ Brown ┆ Mark    │
│ John  ┆ Smith   │
└───────┴─────────┘
[[{'name': 'Brown', 'surname': 'Mark'}], [{'name': 'John', 'surname': 'Smith'}]]

As you can see, for Mark, his surname gets mixed with his name.

From what I understand - this should not be happening. In this simple example I could obviously sort the struct by key names, but if I have a row with hundreds of columns, each one could be this list[struct] (and I don't know which exactly usually), it's not possible. Plus, even if it was - I shouldn't be doing that in my opinion. This should just work.

Is it an issue with deltalake itself? Can I somehow work around that?

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions