-
Couldn't load subscription status.
- Fork 537
Description
What happened?
When creating a delta table with Decimal columns that have a scale value of 0, the creation fails.
The failure occurs for both, when creating a new table (write_deltalake) or when converting existing Parquet files to a table (convert_to_deltalake).
Example code with write_deltalake
import pyarrow as pa
from deltalake import write_deltalake
from decimal import Decimal
# Define the schema with a decimal column
# precision=4, scale=0 means 4 digits total with 0 decimal places
schema = pa.schema([pa.field("decvalue", pa.decimal128(precision=4, scale=0))])
# Create a record with decimal value 1234
data = [Decimal("1234")]
# Create PyArrow table
table = pa.table([data], schema=schema)
# Write the Delta table
write_deltalake(
table_or_uri="./decimal_table",
data=table,
mode="overwrite"
)
The resulting error message is:
Traceback (most recent call last):
File "/home/martin/parquet-to-delta/create-table.py", line 16, in <module>
write_deltalake(
File "/home/martin/parquet-to-delta/.venv/lib/python3.12/site-packages/deltalake/writer/writer.py", line 147, in write_deltalake
write_deltalake_rust(
_internal.DeltaError: Kernel error: Parser error: parse decimal overflow (1234.0)
Expected behavior
The creation of tables must succeed also when columns have type Decimal with scale=0.
The root cause of the error is the handling of stats of the delta table. The stats are written to JSON with serde (e.g. serde_json::to_string(&stats)).
The default processing is that serde serializes decimal (e.g. floating point) values to JSON and always emits a “.0”, even for values that do not have a fractional part. In the example above, the value of “1234” results as “1234.0” in the stats.
Unfortunately the stats are read via Arrow from within the delta kernel (parse_json_impl(json_strings: &StringArray, schema: ArrowSchemaRef)). Reading the same value with the strict checking of Arrow fails, because Arrows’s “parse_decimal” does not consider “1234.0” valid for “precision=4” and “scale=0”.
Note that PySpark creates stats containing “1234” (without a trailing ".0") for such cases. These values can be parsed successfully with the approach in delta kernel.
Possible approach for a fix:
As in other delta implementations, the values for decimals with a scale of 0 could be serialized to JSON as Integer values (without the trailing “.0”).
In apply_min_max_for_column (
delta-rs/crates/core/src/writer/stats.rs
Lines 554 to 562 in 52ada46
| if let Some(min) = statistics.min { | |
| let min = ColumnValueStat::Value(min.into()); | |
| min_values.insert(key.clone(), min); | |
| } | |
| if let Some(max) = statistics.max { | |
| let max = ColumnValueStat::Value(max.into()); | |
| max_values.insert(key.clone(), max); | |
| } |
if let Some(min) = statistics.min {
// For compatibility with Spark and parsing stats with Arrow, if the decimal has scale 0, it is stored as integer
if let (Some(LogicalType::Decimal { scale: 0, .. }), StatsScalar::Decimal(f_val)) = (column_descr.logical_type(), &min) {
let min_value = ColumnValueStat::Value((*f_val as i64).into());
min_values.insert(key.clone(), min_value);
} else {
let min = ColumnValueStat::Value(min.into());
min_values.insert(key.clone(), min);
}
}
if let Some(max) = statistics.max {
// For compatibility with Spark and parsing stats with Arrow, if the decimal has scale 0, it is stored as integer
if let (Some(LogicalType::Decimal { scale: 0, .. }), StatsScalar::Decimal(f_val)) = (column_descr.logical_type(), &max) {
let max_value = ColumnValueStat::Value((*f_val as i64).into());
max_values.insert(key.clone(), max_value);
} else {
let max = ColumnValueStat::Value(max.into());
max_values.insert(key.clone(), max);
}
}
In stats.rs there are already code paths which deal with exactly that special case of Decimals provided as Integers. These are in the opposite direction, i.e. reading those values, which are Integers in the JSON as Decimals into “StatsScalar”:
delta-rs/crates/core/src/writer/stats.rs
Lines 277 to 281 in 52ada46
(Statistics::Int32(v), Some(LogicalType::Decimal { scale, .. })) => { let val = get_stat!(v) as f64 / 10.0_f64.powi(*scale); // Spark serializes these as numbers Ok(Self::Decimal(val)) } delta-rs/crates/core/src/writer/stats.rs
Lines 304 to 308 in 52ada46
(Statistics::Int64(v), Some(LogicalType::Decimal { scale, .. })) => { let val = get_stat!(v) as f64 / 10.0_f64.powi(*scale); // Spark serializes these as numbers Ok(Self::Decimal(val)) }
Operating System
Linux
Binding
Python
Bindings Version
No response
Steps to reproduce
See example code at the beginning.