Skip to content

Inconsistent schema handling for int64 columns in Delta Table updated with pandas object type #3034

@t1g0rz

Description

@t1g0rz

Environment

Delta-rs version: 0.20.2 (also checked on 0.22.0)

Binding: python


Bug

What happened:
If an int64 column (I haven’t checked other types) is specified in the Delta table schema, and this table is updated using a Pandas DataFrame where that column is of object type, the underlying Parquet file will store the data as string. However, when querying the schema, it will show int64, and the data returned will also be of int64 type.
In this case, there seems to be an inconsistency. Pl look at MRE.

What you expected to happen:
I expect it to:

  • throw an exception, as it does when the types are completely incompatible (e.g., bool and string): DeltaError: Generic DeltaTable error: type_coercion; or
  • cast datatypes to those specified in the schema

How to reproduce it:

from deltalake import DeltaTable
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd


DeltaTable.create(
    "tmp",
    schema=pa.schema([pa.field("something", pa.int64())]),
)

dt = DeltaTable("tmp")
dt.merge(
    pd.DataFrame({"something": map(str, range(10))}),
    predicate="s.something = t.something",
    source_alias="s",
    target_alias="t",
).when_not_matched_insert_all().execute()

dt = DeltaTable("tmp")
print("pd:", dt.to_pandas().dtypes, "\n")
print("delta:", dt.schema(), "\n")
print("pa dataset:", dt.to_pyarrow_dataset().schema, "\n")
print("-----")
print("parquet:", pq.read_table("tmp/").schema, "\n")

Output:

pd: something    int64

delta: Schema([Field(something, PrimitiveType("long"), nullable=True)]) 

pa dataset: something: int64 

-----
parquet: something: string 

More details:
If one tries to scan such a delta table with polars>=1.13.0, they will see a SchemaError

pl.scan_delta("tmp").collect() # SchemaError: dtypes differ for column something: Utf8View != Int64

Metadata

Metadata

Assignees

Labels

bugSomething isn't workinggood first issueGood for newcomershelp wantedExtra attention is needed

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions