Skip to content

TIMESTAMP_NTZ datatype is currently not supported in the converter #745

@vucenovic

Description

@vucenovic

https://docs.databricks.com/aws/en/sql/language-manual/data-types/timestamp-ntz-type

Timestamp_NTZ is a data type that can be used for partitioning, so according to the converter function, it should be supported, but the elif branch for it is missing. To be precise, I actually think the current timestamp mapping should be timestamp_ntz, as the pd.Timestamp constructed doesnt get passed any tz_info.

# converter.py
def to_converter(schema_type) -> Callable[[str], Any]:
    """
    For types that support partitioning, a lambda to parse data into the
    corresponding type is returned. For data types that cannot be partitioned
    on, we return None. The caller is expected to check if the value is None before using.
    :param schema_type: str or json representing a data type
    :return: converter function or None
    """
    if schema_type == "boolean":
        return lambda x: None if (x is None or x == "") else (x is True or x == "true")
    elif schema_type == "byte":
        return lambda x: np.nan if (x is None or x == "") else np.int8(x)
    elif schema_type == "short":
        return lambda x: np.nan if (x is None or x == "") else np.int16(x)
    elif schema_type == "integer":
        return lambda x: np.nan if (x is None or x == "") else np.int32(x)
    elif schema_type == "long":
        return lambda x: np.nan if (x is None or x == "") else np.int64(x)
    elif schema_type == "float":
        return lambda x: np.nan if (x is None or x == "") else np.float32(x)
    elif schema_type == "double":
        return lambda x: np.nan if (x is None or x == "") else np.float64(x)
    elif isinstance(schema_type, str) and schema_type.startswith("decimal"):
        return lambda x: None if (x is None or x == "") else Decimal(x)
    elif schema_type == "string":
        return lambda x: None if (x is None or x == "") else str(x)
    elif schema_type == "date":
        return lambda x: None if (x is None or x == "") else pd.Timestamp(x).date()
    elif schema_type == "timestamp":
        return lambda x: pd.NaT if (x is None or x == "") else pd.Timestamp(x)
    elif schema_type == "binary":
        return None  # partition on binary column not supported
    elif isinstance(schema_type, dict) and schema_type["type"] in ("array", "struct", "map"):
        return None  # partition on complex column not supported

    raise ValueError(f"Could not parse datatype: {schema_type}")

How to reproduce: Try reading a table with TIMESTAMP_NTZ column with the following interace:

df = delta_sharing.load_as_pandas(table_url, convert_in_batches=True, use_delta_format=False)

Adding another elif branch with timestamp_ntz solves the issue, I can create a PR if you like.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions