[SPARK-51587][PYTHON][SS] Fix an issue where timestamp cannot be used in ListState when multiple state data is involved #50349

bogao007 · 2025-03-21T18:06:57Z

What changes were proposed in this pull request?

Fix an issue where timestamp cannot be used in ListState when multiple state data is involved.

When transmitting multiple state data, we use Arrow to construct an Arrow record batch from Pandas dataframe, but this needs proper type conversion to make it compatible with Spark.

Timestamp is missed in this conversion util. Since the timestamp precision in Pandas is nanosecond while the precision in Spark is microsecond, we need proper conversion to make them compatible.

Why are the changes needed?

Without this change, using a timestamp type with ListState put() or appendList() will result in below error

[UNSUPPORTED_ARROWTYPE] Unsupported arrow type Timestamp(NANOSECOND, null). SQLSTATE: 0A000

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new test case.

Was this patch authored or co-authored using generative AI tooling?

No.

…e state data is involved

jingz-db · 2025-03-21T21:47:40Z

python/pyspark/sql/pandas/types.py

@@ -1424,6 +1424,12 @@ def _to_numpy_type(type: DataType) -> Optional["np.dtype"]:
        return np.dtype("float32")
    elif type == DoubleType():
        return np.dtype("float64")
+    elif type == TimestampType():


The change is inside the base pandas type files, does this mean all pandas related operators(batch/streaming) are all affected?

I am confused by the PR title, if this change is inside the base pyspark type, why it only impacts timestamp type when "multiple state data" is involved? Why it works fine if there is only single state data?

This utility function is only used by TWS, but we followed the previous suggestion to move this util to the base types file since it's doing basic type conversion. This issue only happens when we use Arrow to transmit state data to the JVM side so only multiple state data is involved.

I added more details in PR's description

Thanks! I understand now since list_state_client is the only variable using _send_arrow_state and _send_arrow_state is the only function using _to_numpy_type

HyukjinKwon · 2025-03-24T04:40:04Z

python/pyspark/sql/pandas/types.py

@@ -1424,6 +1424,12 @@ def _to_numpy_type(type: DataType) -> Optional["np.dtype"]:
        return np.dtype("float32")
    elif type == DoubleType():
        return np.dtype("float64")
+    elif type == TimestampType():


Can we move spark_type_to_pandas_dtype in pandas/typedef/typehints here, and reuse them?

@HyukjinKwon It seems spark_type_to_pandas_dtype uses datetime64[ns] instead of datetime64[us]. This would still return the same error since Spark only supports microsecond when converting from Arrow. We actually have _to_corrected_pandas_type in the same file to reuse, but it also uses nanosecond and would fail in this case. Any suggestions on reusing this but also fixing the issue?

background: we are using Pandas to convert a spark type data to Arrow record batch

spark/python/pyspark/sql/streaming/stateful_processor_api_client.py

Lines 443 to 453 in b229044

def _send_arrow_state(self, schema: StructType, state: List[Tuple]) -> None:

import pyarrow as pa

import pandas as pd

column_names = [field.name for field in schema.fields]

pandas_df = convert_pandas_using_numpy_type(

pd.DataFrame(state, columns=column_names), schema

)

batch = pa.RecordBatch.from_pandas(pandas_df)

self.serializer.dump_stream(iter([batch]), self.sockfile)

self.sockfile.flush()

HyukjinKwon · 2025-03-25T01:22:12Z

@jingz-db does it look good to you?

HyukjinKwon · 2025-03-25T23:33:10Z

Merged to master.

HeartSaVioR · 2025-03-26T00:07:31Z

Since this is a bugfix where we are introducing this first in Spark 4.0.0, do we think there is still a door to fix this in Spark 4.0.0, or at least, Spark 4.0.1?

cc. @cloud-fan as release manager of Spark 4.0.0

HyukjinKwon · 2025-03-26T00:12:31Z

Oh yeah. let's backport I don't mind it

Fix an issue where timestamp cannot be used in ListState when multipl…

85860da

…e state data is involved

github-actions bot added SQL PYTHON labels Mar 21, 2025

fix test

3b8733c

jingz-db reviewed Mar 21, 2025

View reviewed changes

HyukjinKwon reviewed Mar 24, 2025

View reviewed changes

HyukjinKwon approved these changes Mar 25, 2025

View reviewed changes

jingz-db approved these changes Mar 25, 2025

View reviewed changes

HyukjinKwon closed this in 5b91bbf Mar 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-51587][PYTHON][SS] Fix an issue where timestamp cannot be used in ListState when multiple state data is involved #50349

[SPARK-51587][PYTHON][SS] Fix an issue where timestamp cannot be used in ListState when multiple state data is involved #50349

bogao007 commented Mar 21, 2025 •

edited

Loading

jingz-db Mar 21, 2025

jingz-db Mar 21, 2025

bogao007 Mar 21, 2025 •

edited

Loading

bogao007 Mar 21, 2025

jingz-db Mar 25, 2025

HyukjinKwon Mar 24, 2025

bogao007 Mar 24, 2025

bogao007 Mar 24, 2025

HyukjinKwon Mar 25, 2025

HyukjinKwon commented Mar 25, 2025

HyukjinKwon commented Mar 25, 2025

HeartSaVioR commented Mar 26, 2025

HyukjinKwon commented Mar 26, 2025

	def _send_arrow_state(self, schema: StructType, state: List[Tuple]) -> None:
	import pyarrow as pa
	import pandas as pd

	column_names = [field.name for field in schema.fields]
	pandas_df = convert_pandas_using_numpy_type(
	pd.DataFrame(state, columns=column_names), schema
	)
	batch = pa.RecordBatch.from_pandas(pandas_df)
	self.serializer.dump_stream(iter([batch]), self.sockfile)
	self.sockfile.flush()

[SPARK-51587][PYTHON][SS] Fix an issue where timestamp cannot be used in ListState when multiple state data is involved #50349

[SPARK-51587][PYTHON][SS] Fix an issue where timestamp cannot be used in ListState when multiple state data is involved #50349

Conversation

bogao007 commented Mar 21, 2025 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bogao007 Mar 21, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon commented Mar 25, 2025

HyukjinKwon commented Mar 25, 2025

HeartSaVioR commented Mar 26, 2025

HyukjinKwon commented Mar 26, 2025

bogao007 commented Mar 21, 2025 •

edited

Loading

bogao007 Mar 21, 2025 •

edited

Loading