Skip to content

[Bug]: Cannot create tables with Decimal, scale=0 #3893

@MartinKolbAtWork

Description

@MartinKolbAtWork

What happened?

When creating a delta table with Decimal columns that have a scale value of 0, the creation fails.
The failure occurs for both, when creating a new table (write_deltalake) or when converting existing Parquet files to a table (convert_to_deltalake).

Example code with write_deltalake

import pyarrow as pa
from deltalake import write_deltalake
from decimal import Decimal

# Define the schema with a decimal column
# precision=4, scale=0 means 4 digits total with 0 decimal places
schema = pa.schema([pa.field("decvalue", pa.decimal128(precision=4, scale=0))])

# Create a record with decimal value 1234
data = [Decimal("1234")]

# Create PyArrow table
table = pa.table([data], schema=schema)

# Write the Delta table
write_deltalake(
    table_or_uri="./decimal_table",
    data=table,
    mode="overwrite"
)

The resulting error message is:

Traceback (most recent call last):
  File "/home/martin/parquet-to-delta/create-table.py", line 16, in <module>
    write_deltalake(
  File "/home/martin/parquet-to-delta/.venv/lib/python3.12/site-packages/deltalake/writer/writer.py", line 147, in write_deltalake
    write_deltalake_rust(
_internal.DeltaError: Kernel error: Parser error: parse decimal overflow (1234.0)

Expected behavior

The creation of tables must succeed also when columns have type Decimal with scale=0.

The root cause of the error is the handling of stats of the delta table. The stats are written to JSON with serde (e.g. serde_json::to_string(&stats)).
The default processing is that serde serializes decimal (e.g. floating point) values to JSON and always emits a “.0”, even for values that do not have a fractional part. In the example above, the value of “1234” results as “1234.0” in the stats.
Unfortunately the stats are read via Arrow from within the delta kernel (parse_json_impl(json_strings: &StringArray, schema: ArrowSchemaRef)). Reading the same value with the strict checking of Arrow fails, because Arrows’s “parse_decimal” does not consider “1234.0” valid for “precision=4” and “scale=0”.

Note that PySpark creates stats containing “1234” (without a trailing ".0") for such cases. These values can be parsed successfully with the approach in delta kernel.


Possible approach for a fix:
As in other delta implementations, the values for decimals with a scale of 0 could be serialized to JSON as Integer values (without the trailing “.0”).
In apply_min_max_for_column (

if let Some(min) = statistics.min {
let min = ColumnValueStat::Value(min.into());
min_values.insert(key.clone(), min);
}
if let Some(max) = statistics.max {
let max = ColumnValueStat::Value(max.into());
max_values.insert(key.clone(), max);
}
), the values for min and max could be augmented with a small “if” block that handles this case accordingly:

            if let Some(min) = statistics.min {
                // For compatibility with Spark and parsing stats with Arrow, if the decimal has scale 0, it is stored as integer
                if let (Some(LogicalType::Decimal { scale: 0, .. }), StatsScalar::Decimal(f_val)) = (column_descr.logical_type(), &min) {
                    let min_value = ColumnValueStat::Value((*f_val as i64).into());
                    min_values.insert(key.clone(), min_value);
                } else {
                    let min = ColumnValueStat::Value(min.into());
                    min_values.insert(key.clone(), min);
                }
            }

            if let Some(max) = statistics.max {
                // For compatibility with Spark and parsing stats with Arrow, if the decimal has scale 0, it is stored as integer
                if let (Some(LogicalType::Decimal { scale: 0, .. }), StatsScalar::Decimal(f_val)) = (column_descr.logical_type(), &max) {
                    let max_value = ColumnValueStat::Value((*f_val as i64).into());
                    max_values.insert(key.clone(), max_value);
                } else {
                    let max = ColumnValueStat::Value(max.into());
                    max_values.insert(key.clone(), max);
                }
            }

In stats.rs there are already code paths which deal with exactly that special case of Decimals provided as Integers. These are in the opposite direction, i.e. reading those values, which are Integers in the JSON as Decimals into “StatsScalar”:

  • (Statistics::Int32(v), Some(LogicalType::Decimal { scale, .. })) => {
    let val = get_stat!(v) as f64 / 10.0_f64.powi(*scale);
    // Spark serializes these as numbers
    Ok(Self::Decimal(val))
    }
  • (Statistics::Int64(v), Some(LogicalType::Decimal { scale, .. })) => {
    let val = get_stat!(v) as f64 / 10.0_f64.powi(*scale);
    // Spark serializes these as numbers
    Ok(Self::Decimal(val))
    }

Operating System

Linux

Binding

Python

Bindings Version

No response

Steps to reproduce

See example code at the beginning.

Relevant logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions