Open
Description
For validating uniqueness on composite columns constraints, DQX proposes the use of pyspark.sql.functions.struct to package the composing columns into a single struct with those columns as struct fields:
But this doesn't follow the sql standard of ignoring rows with NULL values for some column's values part of the unique constraint.
Is there an existing issue for this?
- I have searched the existing issues
Current Behavior
def test_col_is_unique_handle_nulls_in_composite_cols(spark: SparkSession) -> None:
"""Test shows that is_unique with composite struct doesn't ignore null inner fields.
The SQL standard states that:
'A unique constraint on T is satisfied if and only if there do not exist two rows R1
and R2 of T such that R1 and R2 have the same non-NULL values in the unique cols.'
Applying this statement to the particular case of two rows with NULL values means:
# given the table
T = [Row(a=None, b=None, c=None),
Row(a=None, b=None, c=None)]
# when we run is unique with the composite struct of the unique columns
is_unique(struct('b', 'c'))
# then we get that the unique constraint is not satisfied
Value '{null, null}' in Column 'struct(b, c)' is not unique
Value '{null, null}' in Column 'struct(b, c)' is not unique
So this test shows that we can't have two rows with only NULL values.
Args:
spark (SparkSession): The spark session needed for context
"""
test_df = spark.createDataFrame(
data=[
[1, "", None],
[2, "", None],
[3, "str1", 1],
[4, "str1", 1],
[None, None, None],
[None, None, None],
],
schema="a: int, b: string, c:int",
)
actual = test_df.select(is_unique(f.col("a")), is_unique(f.struct("b", "c")))
checked_schema = "a_is_not_unique: string, struct_b_c_is_not_unique: string"
expected = spark.createDataFrame(
[
[None, "Value '{, null}' in Column 'struct(b, c)' is not unique"],
[None, "Value '{, null}' in Column 'struct(b, c)' is not unique"],
[None, "Value '{str1, 1}' in Column 'struct(b, c)' is not unique"],
[None, "Value '{str1, 1}' in Column 'struct(b, c)' is not unique"],
[None, "Value '{null, null}' in Column 'struct(b, c)' is not unique"],
[None, "Value '{null, null}' in Column 'struct(b, c)' is not unique"],
],
checked_schema,
)
assertDataFrameEqual(actual, expected)
Expected Behavior
def test_col_is_unique_handle_nulls_in_composite_cols(spark: SparkSession) -> None:
test_df = spark.createDataFrame(
data=[
[1, "", None],
[2, "", None],
[3, "str1", 1],
[4, "str1", 1],
[None, None, None],
[None, None, None],
],
schema="a: int, b: string, c:int",
)
actual = test_df.select(is_unique(f.col("a")), is_unique("b", "c"))
checked_schema = "a_is_not_unique: string, b_c_is_not_unique: string"
expected = spark.createDataFrame(
[
[None, None],
[None, None],
[None, "Value 'str1, 1' in Column 'b, c' is not unique"],
[None, "Value 'str1, 1' in Column 'b, c' is not unique"],
[None, None],
[None, None],
],
checked_schema,
)
assertDataFrameEqual(actual, expected)
Steps To Reproduce
- Add this test to the test suite
- Run the test suite.
Cloud
AWS
Operating System
Linux