[BUG]: is_unique checks with composite unique columns marks all-NULL values as not unique

For validating uniqueness on composite columns constraints, DQX proposes the use of pyspark.sql.functions.struct to package the composing columns into a single struct with those columns as struct fields:

https://github.com/databrickslabs/dqx/blob/022fdb5b520139b5dec9404bc771006c6ab16255/docs/dqx/docs/reference/quality_rules.mdx#is_unique-for-multiple-columns-composite-key

But this doesn't follow the sql standard of ignoring rows with NULL values for some column's values part of the unique constraint.

### Is there an existing issue for this?

- [x] I have searched the existing issues

### Current Behavior


def test_col_is_unique_handle_nulls_in_composite_cols(spark: SparkSession) -> None:
    """Test shows that is_unique with composite struct doesn't ignore null inner fields.

    The SQL standard states that:
    'A unique constraint on T is satisfied if and only if there do not exist two rows R1
    and R2 of T such that R1 and R2 have the same non-NULL values in the unique cols.'

    Applying this statement to the particular case of two rows with NULL values means:

    # given the table
    T = [Row(a=None, b=None, c=None),
            Row(a=None, b=None, c=None)]

    # when we run is unique with the composite struct of the unique columns
    is_unique(struct('b', 'c'))

    # then we get that the unique constraint is not satisfied
    Value '{null, null}' in Column 'struct(b, c)' is not unique
    Value '{null, null}' in Column 'struct(b, c)' is not unique

    So this test shows that we can't have two rows with only NULL values.

    Args:
        spark (SparkSession): The spark session needed for context

    """
    test_df = spark.createDataFrame(
        data=[
            [1, "", None],
            [2, "", None],
            [3, "str1", 1],
            [4, "str1", 1],
            [None, None, None],
            [None, None, None],
        ],
        schema="a: int, b: string, c:int",
    )

    actual = test_df.select(is_unique(f.col("a")), is_unique(f.struct("b", "c")))

    checked_schema = "a_is_not_unique: string, struct_b_c_is_not_unique: string"
    expected = spark.createDataFrame(
        [
            [None, "Value '{, null}' in Column 'struct(b, c)' is not unique"],
            [None, "Value '{, null}' in Column 'struct(b, c)' is not unique"],
            [None, "Value '{str1, 1}' in Column 'struct(b, c)' is not unique"],
            [None, "Value '{str1, 1}' in Column 'struct(b, c)' is not unique"],
            [None, "Value '{null, null}' in Column 'struct(b, c)' is not unique"],
            [None, "Value '{null, null}' in Column 'struct(b, c)' is not unique"],
        ],
        checked_schema,
    )
    assertDataFrameEqual(actual, expected)


### Expected Behavior

def test_col_is_unique_handle_nulls_in_composite_cols(spark: SparkSession) -> None:

    test_df = spark.createDataFrame(
        data=[
            [1, "", None],
            [2, "", None],
            [3, "str1", 1],
            [4, "str1", 1],
            [None, None, None],
            [None, None, None],
        ],
        schema="a: int, b: string, c:int",
    )

    actual = test_df.select(is_unique(f.col("a")), is_unique("b", "c"))

    checked_schema = "a_is_not_unique: string, b_c_is_not_unique: string"
    expected = spark.createDataFrame(
        [
            [None, None],
            [None, None],
            [None, "Value 'str1, 1' in Column 'b, c' is not unique"],
            [None, "Value 'str1, 1' in Column 'b, c' is not unique"],
            [None, None],
            [None, None],
        ],
        checked_schema,
    )
    assertDataFrameEqual(actual, expected)


### Steps To Reproduce

1. Add this test to the test suite
2. Run the test suite.

### Cloud

AWS

### Operating System

Linux

### Relevant log output

```shell

```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG]: is_unique checks with composite unique columns marks all-NULL values as not unique #344

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Cloud

Operating System

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG]: is_unique checks with composite unique columns marks all-NULL values as not unique #344

Description

Is there an existing issue for this?

Current Behavior

Expected Behavior

Steps To Reproduce

Cloud

Operating System

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions