Skip to content

[BUG]: is_unique checks with composite unique columns marks all-NULL values as not unique #344

Open
@ovidiu-eremia

Description

@ovidiu-eremia

For validating uniqueness on composite columns constraints, DQX proposes the use of pyspark.sql.functions.struct to package the composing columns into a single struct with those columns as struct fields:

https://github.com/databrickslabs/dqx/blob/022fdb5b520139b5dec9404bc771006c6ab16255/docs/dqx/docs/reference/quality_rules.mdx#is_unique-for-multiple-columns-composite-key

But this doesn't follow the sql standard of ignoring rows with NULL values for some column's values part of the unique constraint.

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

def test_col_is_unique_handle_nulls_in_composite_cols(spark: SparkSession) -> None:
"""Test shows that is_unique with composite struct doesn't ignore null inner fields.

The SQL standard states that:
'A unique constraint on T is satisfied if and only if there do not exist two rows R1
and R2 of T such that R1 and R2 have the same non-NULL values in the unique cols.'

Applying this statement to the particular case of two rows with NULL values means:

# given the table
T = [Row(a=None, b=None, c=None),
        Row(a=None, b=None, c=None)]

# when we run is unique with the composite struct of the unique columns
is_unique(struct('b', 'c'))

# then we get that the unique constraint is not satisfied
Value '{null, null}' in Column 'struct(b, c)' is not unique
Value '{null, null}' in Column 'struct(b, c)' is not unique

So this test shows that we can't have two rows with only NULL values.

Args:
    spark (SparkSession): The spark session needed for context

"""
test_df = spark.createDataFrame(
    data=[
        [1, "", None],
        [2, "", None],
        [3, "str1", 1],
        [4, "str1", 1],
        [None, None, None],
        [None, None, None],
    ],
    schema="a: int, b: string, c:int",
)

actual = test_df.select(is_unique(f.col("a")), is_unique(f.struct("b", "c")))

checked_schema = "a_is_not_unique: string, struct_b_c_is_not_unique: string"
expected = spark.createDataFrame(
    [
        [None, "Value '{, null}' in Column 'struct(b, c)' is not unique"],
        [None, "Value '{, null}' in Column 'struct(b, c)' is not unique"],
        [None, "Value '{str1, 1}' in Column 'struct(b, c)' is not unique"],
        [None, "Value '{str1, 1}' in Column 'struct(b, c)' is not unique"],
        [None, "Value '{null, null}' in Column 'struct(b, c)' is not unique"],
        [None, "Value '{null, null}' in Column 'struct(b, c)' is not unique"],
    ],
    checked_schema,
)
assertDataFrameEqual(actual, expected)

Expected Behavior

def test_col_is_unique_handle_nulls_in_composite_cols(spark: SparkSession) -> None:

test_df = spark.createDataFrame(
    data=[
        [1, "", None],
        [2, "", None],
        [3, "str1", 1],
        [4, "str1", 1],
        [None, None, None],
        [None, None, None],
    ],
    schema="a: int, b: string, c:int",
)

actual = test_df.select(is_unique(f.col("a")), is_unique("b", "c"))

checked_schema = "a_is_not_unique: string, b_c_is_not_unique: string"
expected = spark.createDataFrame(
    [
        [None, None],
        [None, None],
        [None, "Value 'str1, 1' in Column 'b, c' is not unique"],
        [None, "Value 'str1, 1' in Column 'b, c' is not unique"],
        [None, None],
        [None, None],
    ],
    checked_schema,
)
assertDataFrameEqual(actual, expected)

Steps To Reproduce

  1. Add this test to the test suite
  2. Run the test suite.

Cloud

AWS

Operating System

Linux

Relevant log output

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions