-
Couldn't load subscription status.
- Fork 537
Description
Environment
Delta-rs version: 0.20.2 (also checked on 0.22.0)
Binding: python
Bug
What happened:
If an int64 column (I haven’t checked other types) is specified in the Delta table schema, and this table is updated using a Pandas DataFrame where that column is of object type, the underlying Parquet file will store the data as string. However, when querying the schema, it will show int64, and the data returned will also be of int64 type.
In this case, there seems to be an inconsistency. Pl look at MRE.
What you expected to happen:
I expect it to:
- throw an exception, as it does when the types are completely incompatible (e.g.,
boolandstring):DeltaError: Generic DeltaTable error: type_coercion; or - cast datatypes to those specified in the schema
How to reproduce it:
from deltalake import DeltaTable
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
DeltaTable.create(
"tmp",
schema=pa.schema([pa.field("something", pa.int64())]),
)
dt = DeltaTable("tmp")
dt.merge(
pd.DataFrame({"something": map(str, range(10))}),
predicate="s.something = t.something",
source_alias="s",
target_alias="t",
).when_not_matched_insert_all().execute()
dt = DeltaTable("tmp")
print("pd:", dt.to_pandas().dtypes, "\n")
print("delta:", dt.schema(), "\n")
print("pa dataset:", dt.to_pyarrow_dataset().schema, "\n")
print("-----")
print("parquet:", pq.read_table("tmp/").schema, "\n")Output:
pd: something int64
delta: Schema([Field(something, PrimitiveType("long"), nullable=True)])
pa dataset: something: int64
-----
parquet: something: string
More details:
If one tries to scan such a delta table with polars>=1.13.0, they will see a SchemaError
pl.scan_delta("tmp").collect() # SchemaError: dtypes differ for column something: Utf8View != Int64