-
Couldn't load subscription status.
- Fork 537
Description
Environment
Delta-rs version: not sure?
Binding: 1.1.4
Bug
When dumping a DataFrame with structs into deltalake, the data gets tainted if the order of the fields in a struct changes. If it would be just a list of strings - sure thing, but since structs have named keys - it shouldn't happen AFAIK.
Best to show with an example:
import os
import shutil
import polars as pl
from deltalake import write_deltalake
DELTA_PATH = "data/test_schema"
if os.path.exists(DELTA_PATH):
shutil.rmtree(DELTA_PATH)
a = {"user_data": [{"name": "John", "surname": "Smith"}]}
b = {"user_data": [{"surname": "Brown", "name": "Mark"}]}
def unnest_df(df: pl.DataFrame) -> pl.DataFrame:
return df.explode("user_data").unnest("user_data")
df_a = pl.from_records([a])
print(unnest_df(df_a))
df_b = pl.from_records([b])
print(unnest_df(df_b))
write_deltalake(DELTA_PATH, df_a, mode="append", schema_mode="merge")
write_deltalake(DELTA_PATH, df_b, mode="append", schema_mode="merge")
df = pl.read_delta(DELTA_PATH)
print(df)
print(unnest_df(df))
print(df["user_data"].to_list())So, one DataFrame that has a column with a list[struct] with name and surname fields. The second dataframe is identical, but the keys in the struct are: surname, name (the same, but changed order). I dump both into deltalake and the result is this: (I used unnesting so it's easier to see):
shape: (1, 2)
┌──────┬─────────┐
│ name ┆ surname │
│ --- ┆ --- │
│ str ┆ str │
╞══════╪═════════╡
│ John ┆ Smith │
└──────┴─────────┘
shape: (1, 2)
┌─────────┬──────┐
│ surname ┆ name │
│ --- ┆ --- │
│ str ┆ str │
╞═════════╪══════╡
│ Brown ┆ Mark │
└─────────┴──────┘
shape: (2, 2)
┌───────┬─────────┐
│ name ┆ surname │
│ --- ┆ --- │
│ str ┆ str │
╞═══════╪═════════╡
│ Brown ┆ Mark │
│ John ┆ Smith │
└───────┴─────────┘
[[{'name': 'Brown', 'surname': 'Mark'}], [{'name': 'John', 'surname': 'Smith'}]]As you can see, for Mark, his surname gets mixed with his name.
From what I understand - this should not be happening. In this simple example I could obviously sort the struct by key names, but if I have a row with hundreds of columns, each one could be this list[struct] (and I don't know which exactly usually), it's not possible. Plus, even if it was - I shouldn't be doing that in my opinion. This should just work.
Is it an issue with deltalake itself? Can I somehow work around that?