Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-51576][SQL] Prohibit ANSI cast from array<variant, containsNull = false> to array<_, containsNull = false> #50343

Conversation

harshmotw-db
Copy link
Contributor

@harshmotw-db harshmotw-db commented Mar 20, 2025

What changes were proposed in this pull request?

This PR prohibits ANSI cast from ArrayType(VariantType, containsNull = false) to ArrayType(_, containsNull = false), or StructField(VariantType, nullable = false) to StructField(_, nullable = true) or MapType(_, VariantType, valueContainsNull = false) to MapType(_, _, valueContainsNull = false).

Why are the changes needed?

Variant is the only data type which can be ANSI cast from a non-null value to a null value. More specifically, variant nulls (generated using parse_json('null')) when cast to string result in null.
However, array<variant, containsNull = false> can be cast to array<string, containsNull = true>. This should not be allowed since array(parse_json('null')) would give array(null) after this cast.
Here is a demonstration of this problem where we create an array<string, containsNull = true> which in fact does not contain nulls:

>>> from pyspark.sql.functions import col
>>> from pyspark.sql.types import StringType, VariantType, ArrayType
>>> 
>>> df = spark.sql("select array(parse_json('null')) arr")
>>> df.printSchema()
root
 |-- arr: array (nullable = false)
 |    |-- element: variant (containsNull = false)

>>> df2 = df.select(col('arr').cast(ArrayType(StringType(), False)))
>>> df2.selectExpr("arr[0] is null").show()
+----------------+
|(arr[0] IS NULL)|
+----------------+
|            true|
+----------------+

>>> df2.printSchema()
root
 |-- arr: array (nullable = false)
 |    |-- element: string (containsNull = false)

Does this PR introduce any user-facing change?

Yes, users can no longer ANSI cast from array<variant, containsNull = false> to array<_, containsNull = false>. However, it is unlikely that users were hardcoding containsNull = false. The same is the case for struct field and maps.

How was this patch tested?

Unit tests making sure that only containsNull = false cases fail.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the SQL label Mar 20, 2025
case (ArrayType(fromType, fn), ArrayType(toType, tn)) =>
canAnsiCast(fromType, toType) && resolvableNullability(fn, tn)
canAnsiCast(fromType, toType) && resolvableNullability(fn || (fromType == VariantType), tn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we want to do is to follow Cast.canCast

    case (ArrayType(fromType, fn), ArrayType(toType, tn)) =>
      canCast(fromType, toType) &&
        resolvableNullability(fn || forceNullable(fromType, toType), tn)

fromType == VariantType makes sense, but shall we consider the deeply nested cases like array of array of variant?

Copy link
Contributor Author

@harshmotw-db harshmotw-db Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if non-null array<variant> can cast to null. Yeah, I think forceNullable is also complete in this case but only the variant case would apply in forceNullable because every other type that fails in resolveNullability would also fail in canAnsiCast. I'll change it to resolveNullability because it is less brittle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh forceNullable causes it to fail for some reason. Looking into it.

Copy link
Contributor Author

@harshmotw-db harshmotw-db Mar 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We cannot do forceNullable because forceNullable(String, Boolean) returns true and String to Boolean cast is allowed in ANSI - just that if it fails, the cast throws an error instead of giving nulls.
Variant is the only data type which can legally cast from non-null to null in ANSI.

I've changed it back to my original version. Let me know if you can think of any cases where it could cause problems.

@harshmotw-db harshmotw-db force-pushed the harsh-motwani_data/variant_cast_fix branch from d1045a9 to df578a1 Compare March 21, 2025 17:43
@HyukjinKwon HyukjinKwon changed the title [SPARK-51576] Prohibit ANSI cast from array<variant, containsNull = false> to array<_, containsNull = false> [SPARK-51576][SQL] Prohibit ANSI cast from array<variant, containsNull = false> to array<_, containsNull = false> Mar 24, 2025
@cloud-fan
Copy link
Contributor

thanks, merging to master/4.0!

@cloud-fan cloud-fan closed this in 084143b Mar 24, 2025
cloud-fan pushed a commit that referenced this pull request Mar 24, 2025
…ll = false>` to `array<_, containsNull = false>`

### What changes were proposed in this pull request?

This PR prohibits ANSI cast from `ArrayType(VariantType, containsNull = false)` to `ArrayType(_, containsNull = false)`, or `StructField(VariantType, nullable = false)` to `StructField(_, nullable = true)` or `MapType(_, VariantType, valueContainsNull = false)` to `MapType(_, _, valueContainsNull = false)`.

### Why are the changes needed?

Variant is the only data type which can be ANSI cast from a non-null value to a null value. More specifically, variant nulls (generated using `parse_json('null')`) when cast to string result in `null`.
However, `array<variant, containsNull = false>` can be cast to `array<string, containsNull = true>`. This should not be allowed since `array(parse_json('null'))` would give `array(null)` after this cast.
Here is a demonstration of this problem where we create an `array<string, containsNull = true>` which in fact does not contain nulls:
```
>>> from pyspark.sql.functions import col
>>> from pyspark.sql.types import StringType, VariantType, ArrayType
>>> 
>>> df = spark.sql("select array(parse_json('null')) arr")
>>> df.printSchema()
root
 |-- arr: array (nullable = false)
 |    |-- element: variant (containsNull = false)

>>> df2 = df.select(col('arr').cast(ArrayType(StringType(), False)))
>>> df2.selectExpr("arr[0] is null").show()
+----------------+
|(arr[0] IS NULL)|
+----------------+
|            true|
+----------------+

>>> df2.printSchema()
root
 |-- arr: array (nullable = false)
 |    |-- element: string (containsNull = false)
```

### Does this PR introduce _any_ user-facing change?

Yes, users can no longer ANSI cast from `array<variant, containsNull = false>` to `array<_, containsNull = false>`. However, it is unlikely that users were hardcoding `containsNull = false`. The same is the case for struct field and maps.

### How was this patch tested?

Unit tests making sure that only `containsNull = false` cases fail.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes #50343 from harshmotw-db/harsh-motwani_data/variant_cast_fix.

Authored-by: Harsh Motwani <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 084143b)
Signed-off-by: Wenchen Fan <[email protected]>
SauronShepherd pushed a commit to SauronShepherd/spark that referenced this pull request Mar 25, 2025
…ll = false>` to `array<_, containsNull = false>`

### What changes were proposed in this pull request?

This PR prohibits ANSI cast from `ArrayType(VariantType, containsNull = false)` to `ArrayType(_, containsNull = false)`, or `StructField(VariantType, nullable = false)` to `StructField(_, nullable = true)` or `MapType(_, VariantType, valueContainsNull = false)` to `MapType(_, _, valueContainsNull = false)`.

### Why are the changes needed?

Variant is the only data type which can be ANSI cast from a non-null value to a null value. More specifically, variant nulls (generated using `parse_json('null')`) when cast to string result in `null`.
However, `array<variant, containsNull = false>` can be cast to `array<string, containsNull = true>`. This should not be allowed since `array(parse_json('null'))` would give `array(null)` after this cast.
Here is a demonstration of this problem where we create an `array<string, containsNull = true>` which in fact does not contain nulls:
```
>>> from pyspark.sql.functions import col
>>> from pyspark.sql.types import StringType, VariantType, ArrayType
>>> 
>>> df = spark.sql("select array(parse_json('null')) arr")
>>> df.printSchema()
root
 |-- arr: array (nullable = false)
 |    |-- element: variant (containsNull = false)

>>> df2 = df.select(col('arr').cast(ArrayType(StringType(), False)))
>>> df2.selectExpr("arr[0] is null").show()
+----------------+
|(arr[0] IS NULL)|
+----------------+
|            true|
+----------------+

>>> df2.printSchema()
root
 |-- arr: array (nullable = false)
 |    |-- element: string (containsNull = false)
```

### Does this PR introduce _any_ user-facing change?

Yes, users can no longer ANSI cast from `array<variant, containsNull = false>` to `array<_, containsNull = false>`. However, it is unlikely that users were hardcoding `containsNull = false`. The same is the case for struct field and maps.

### How was this patch tested?

Unit tests making sure that only `containsNull = false` cases fail.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#50343 from harshmotw-db/harsh-motwani_data/variant_cast_fix.

Authored-by: Harsh Motwani <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants