-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-51576][SQL] Prohibit ANSI cast from array<variant, containsNull = false>
to array<_, containsNull = false>
#50343
Conversation
case (ArrayType(fromType, fn), ArrayType(toType, tn)) => | ||
canAnsiCast(fromType, toType) && resolvableNullability(fn, tn) | ||
canAnsiCast(fromType, toType) && resolvableNullability(fn || (fromType == VariantType), tn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we want to do is to follow Cast.canCast
case (ArrayType(fromType, fn), ArrayType(toType, tn)) =>
canCast(fromType, toType) &&
resolvableNullability(fn || forceNullable(fromType, toType), tn)
fromType == VariantType
makes sense, but shall we consider the deeply nested cases like array of array of variant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know if non-null array<variant>
can cast to null. Yeah, I think forceNullable is also complete in this case but only the variant case would apply in forceNullable because every other type that fails in resolveNullability would also fail in canAnsiCast. I'll change it to resolveNullability because it is less brittle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh forceNullable causes it to fail for some reason. Looking into it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We cannot do forceNullable because forceNullable(String, Boolean)
returns true and String to Boolean cast is allowed in ANSI - just that if it fails, the cast throws an error instead of giving nulls.
Variant is the only data type which can legally cast from non-null to null in ANSI.
I've changed it back to my original version. Let me know if you can think of any cases where it could cause problems.
d1045a9
to
df578a1
Compare
array<variant, containsNull = false>
to array<_, containsNull = false>
array<variant, containsNull = false>
to array<_, containsNull = false>
thanks, merging to master/4.0! |
…ll = false>` to `array<_, containsNull = false>` ### What changes were proposed in this pull request? This PR prohibits ANSI cast from `ArrayType(VariantType, containsNull = false)` to `ArrayType(_, containsNull = false)`, or `StructField(VariantType, nullable = false)` to `StructField(_, nullable = true)` or `MapType(_, VariantType, valueContainsNull = false)` to `MapType(_, _, valueContainsNull = false)`. ### Why are the changes needed? Variant is the only data type which can be ANSI cast from a non-null value to a null value. More specifically, variant nulls (generated using `parse_json('null')`) when cast to string result in `null`. However, `array<variant, containsNull = false>` can be cast to `array<string, containsNull = true>`. This should not be allowed since `array(parse_json('null'))` would give `array(null)` after this cast. Here is a demonstration of this problem where we create an `array<string, containsNull = true>` which in fact does not contain nulls: ``` >>> from pyspark.sql.functions import col >>> from pyspark.sql.types import StringType, VariantType, ArrayType >>> >>> df = spark.sql("select array(parse_json('null')) arr") >>> df.printSchema() root |-- arr: array (nullable = false) | |-- element: variant (containsNull = false) >>> df2 = df.select(col('arr').cast(ArrayType(StringType(), False))) >>> df2.selectExpr("arr[0] is null").show() +----------------+ |(arr[0] IS NULL)| +----------------+ | true| +----------------+ >>> df2.printSchema() root |-- arr: array (nullable = false) | |-- element: string (containsNull = false) ``` ### Does this PR introduce _any_ user-facing change? Yes, users can no longer ANSI cast from `array<variant, containsNull = false>` to `array<_, containsNull = false>`. However, it is unlikely that users were hardcoding `containsNull = false`. The same is the case for struct field and maps. ### How was this patch tested? Unit tests making sure that only `containsNull = false` cases fail. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50343 from harshmotw-db/harsh-motwani_data/variant_cast_fix. Authored-by: Harsh Motwani <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit 084143b) Signed-off-by: Wenchen Fan <[email protected]>
…ll = false>` to `array<_, containsNull = false>` ### What changes were proposed in this pull request? This PR prohibits ANSI cast from `ArrayType(VariantType, containsNull = false)` to `ArrayType(_, containsNull = false)`, or `StructField(VariantType, nullable = false)` to `StructField(_, nullable = true)` or `MapType(_, VariantType, valueContainsNull = false)` to `MapType(_, _, valueContainsNull = false)`. ### Why are the changes needed? Variant is the only data type which can be ANSI cast from a non-null value to a null value. More specifically, variant nulls (generated using `parse_json('null')`) when cast to string result in `null`. However, `array<variant, containsNull = false>` can be cast to `array<string, containsNull = true>`. This should not be allowed since `array(parse_json('null'))` would give `array(null)` after this cast. Here is a demonstration of this problem where we create an `array<string, containsNull = true>` which in fact does not contain nulls: ``` >>> from pyspark.sql.functions import col >>> from pyspark.sql.types import StringType, VariantType, ArrayType >>> >>> df = spark.sql("select array(parse_json('null')) arr") >>> df.printSchema() root |-- arr: array (nullable = false) | |-- element: variant (containsNull = false) >>> df2 = df.select(col('arr').cast(ArrayType(StringType(), False))) >>> df2.selectExpr("arr[0] is null").show() +----------------+ |(arr[0] IS NULL)| +----------------+ | true| +----------------+ >>> df2.printSchema() root |-- arr: array (nullable = false) | |-- element: string (containsNull = false) ``` ### Does this PR introduce _any_ user-facing change? Yes, users can no longer ANSI cast from `array<variant, containsNull = false>` to `array<_, containsNull = false>`. However, it is unlikely that users were hardcoding `containsNull = false`. The same is the case for struct field and maps. ### How was this patch tested? Unit tests making sure that only `containsNull = false` cases fail. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50343 from harshmotw-db/harsh-motwani_data/variant_cast_fix. Authored-by: Harsh Motwani <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
This PR prohibits ANSI cast from
ArrayType(VariantType, containsNull = false)
toArrayType(_, containsNull = false)
, orStructField(VariantType, nullable = false)
toStructField(_, nullable = true)
orMapType(_, VariantType, valueContainsNull = false)
toMapType(_, _, valueContainsNull = false)
.Why are the changes needed?
Variant is the only data type which can be ANSI cast from a non-null value to a null value. More specifically, variant nulls (generated using
parse_json('null')
) when cast to string result innull
.However,
array<variant, containsNull = false>
can be cast toarray<string, containsNull = true>
. This should not be allowed sincearray(parse_json('null'))
would givearray(null)
after this cast.Here is a demonstration of this problem where we create an
array<string, containsNull = true>
which in fact does not contain nulls:Does this PR introduce any user-facing change?
Yes, users can no longer ANSI cast from
array<variant, containsNull = false>
toarray<_, containsNull = false>
. However, it is unlikely that users were hardcodingcontainsNull = false
. The same is the case for struct field and maps.How was this patch tested?
Unit tests making sure that only
containsNull = false
cases fail.Was this patch authored or co-authored using generative AI tooling?
No.