Skip to content
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/dqx/docs/reference/quality_checks.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,8 @@ You can also define your own custom checks (see [Creating custom checks](#creati
| `is_not_null` | Checks whether the values in the input column are not null. | `column`: column to check (can be a string column name or a column expression) |
| `is_not_empty` | Checks whether the values in the input column are not empty (but may be null). | `column`: column to check (can be a string column name or a column expression) |
| `is_not_null_and_not_empty` | Checks whether the values in the input column are not null and not empty. | `column`: column to check (can be a string column name or a column expression); `trim_strings`: optional boolean flag to trim spaces from strings |
| `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |
| `is_not_null_and_is_in_list` | Checks whether the values in the input column are not null and present in the list of allowed values. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |
| `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the format of other check. Add the new param to the Arguments column, instead of describing it in the Description

| `is_not_null_and_is_in_list` | Checks whether the values in the input column are not null and present in the list of allowed values. We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the format of other check. Add the new param to the Arguments column, instead of describing it in the Description

Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter documentation in the third column doesn't mention the new case_sensitive parameter. The parameters column should be updated to include: case_sensitive: optional boolean flag for case-sensitive comparison (default: True).

Suggested change
| `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |
| `is_not_null_and_is_in_list` | Checks whether the values in the input column are not null and present in the list of allowed values. We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |
| `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values; `case_sensitive`: optional boolean flag for case-sensitive comparison (default: True) |
| `is_not_null_and_is_in_list` | Checks whether the values in the input column are not null and present in the list of allowed values. We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values; `case_sensitive`: optional boolean flag for case-sensitive comparison (default: True) |

Copilot uses AI. Check for mistakes.
| `is_not_null_and_not_empty_array` | Checks whether the values in the array input column are not null and not empty. | `column`: column to check (can be a string column name or a column expression) |
| `is_in_range` | Checks whether the values in the input column are in the provided range (inclusive of both boundaries). | `column`: column to check (can be a string column name or a column expression); `min_limit`: min limit as number, date, timestamp, column name or sql expression; `max_limit`: max limit as number, date, timestamp, column name or sql expression |
| `is_not_in_range` | Checks whether the values in the input column are outside the provided range (inclusive of both boundaries). | `column`: column to check (can be a string column name or a column expression); `min_limit`: min limit as number, date, timestamp, column name or sql expression; `max_limit`: max limit as number, date, timestamp, column name or sql expression |
Expand Down
46 changes: 36 additions & 10 deletions src/databricks/labs/dqx/check_funcs.py
Original file line number Diff line number Diff line change
Expand Up @@ -129,12 +129,13 @@ def is_not_null(column: str | Column) -> Column:


@register_rule("row")
def is_not_null_and_is_in_list(column: str | Column, allowed: list) -> Column:
def is_not_null_and_is_in_list(column: str | Column, allowed: list, case_sensitive: bool = True) -> Column:
"""Checks whether the values in the input column are not null and present in the list of allowed values.

Can optionally perform a case-insensitive comparison.
Args:
column: column to check; can be a string column name or a column expression
allowed: list of allowed values (actual values or Column objects)
case_sensitive: whether to perform a case-sensitive comparison (default: True)

Returns:
Column object for condition
Expand All @@ -152,31 +153,44 @@ def is_not_null_and_is_in_list(column: str | Column, allowed: list) -> Column:
if not allowed:
raise InvalidParameterError("allowed list must not be empty.")

allowed_cols = [item if isinstance(item, Column) else F.lit(item) for item in allowed]
# Keep original values for display in error message
allowed_cols_display = [item if isinstance(item, Column) else F.lit(item) for item in allowed]

col_str_norm, col_expr_str, col_expr = _get_normalized_column_and_expr(column)
condition = col_expr.isNull() | ~col_expr.isin(*allowed_cols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this pls?
I think we can keep most of the code intact, just apply lower if needed.

for example:

col_expr_compare = F.lower(col_expr) if not case_sensitive else col_expr
allowed_cols_compare = [F.lower(c) for c in allowed_cols_display] if not case_sensitive else allowed_cols_display

condition = col_expr.isNull() | (~col_expr_compare.isin(*allowed_cols_compare))


# Create columns for comparison
allowed_cols_compare = allowed_cols_display[:]
col_expr_compare = col_expr

# Case-insensitive normalization
if not case_sensitive:
col_expr_compare = F.lower(col_expr)
allowed_cols_compare = [F.lower(c) for c in allowed_cols_display]

condition = col_expr.isNull() | ~col_expr_compare.isin(*allowed_cols_compare)
return make_condition(
condition,
F.concat_ws(
"",
F.lit("Value '"),
F.when(col_expr.isNull(), F.lit("null")).otherwise(col_expr.cast("string")),
F.lit(f"' in Column '{col_expr_str}' is null or not in the allowed list: ["),
F.concat_ws(", ", *allowed_cols),
F.concat_ws(", ", *allowed_cols_display),
F.lit("]"),
),
f"{col_str_norm}_is_null_or_is_not_in_the_list",
)


@register_rule("row")
def is_in_list(column: str | Column, allowed: list) -> Column:
def is_in_list(column: str | Column, allowed: list, case_sensitive: bool = True) -> Column:
"""Checks whether the values in the input column are present in the list of allowed values
(null values are allowed).
(null values are allowed). Can optionally perform a case-insensitive comparison.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe note the limitations for MapType and StructType here in the docstring.


Args:
column: column to check; can be a string column name or a column expression
allowed: list of allowed values (actual values or Column objects)
case_sensitive: whether to perform a case-sensitive comparison (default: True)

Returns:
Column object for condition
Expand All @@ -194,17 +208,29 @@ def is_in_list(column: str | Column, allowed: list) -> Column:
if not allowed:
raise InvalidParameterError("allowed list must not be empty.")

allowed_cols = [item if isinstance(item, Column) else F.lit(item) for item in allowed]
# Keep original values for display in error message
allowed_cols_display = [item if isinstance(item, Column) else F.lit(item) for item in allowed]
Copy link
Contributor

@mwojtyczka mwojtyczka Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, simplify pls


col_str_norm, col_expr_str, col_expr = _get_normalized_column_and_expr(column)
condition = ~col_expr.isin(*allowed_cols)

# Create columns for comparison
allowed_cols_compare = allowed_cols_display[:]
col_expr_compare = col_expr

# Case-insensitive normalization
if not case_sensitive:
col_expr_compare = F.lower(col_expr)
allowed_cols_compare = [F.lower(c) for c in allowed_cols_display]

condition = ~col_expr_compare.isin(*allowed_cols_compare)
return make_condition(
condition,
F.concat_ws(
"",
F.lit("Value '"),
F.when(col_expr.isNull(), F.lit("null")).otherwise(col_expr.cast("string")),
F.lit(f"' in Column '{col_expr_str}' is not in the allowed list: ["),
F.concat_ws(", ", *allowed_cols),
F.concat_ws(", ", *allowed_cols_display),
F.lit("]"),
),
f"{col_str_norm}_is_not_in_the_list",
Expand Down
30 changes: 15 additions & 15 deletions tests/integration/test_row_checks.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,10 +166,10 @@ def test_col_is_not_null_and_is_in_list(spark):
)

actual = test_df.select(
is_not_null_and_is_in_list("a", ["str1"]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't remove existing tests. Add additional tests instead.

is_not_null_and_is_in_list("b", [F.lit(3)]),
is_not_null_and_is_in_list(F.col("c").getItem("val"), [F.lit("a")]),
is_not_null_and_is_in_list(F.try_element_at("d", F.lit(2)), ["b"]),
is_not_null_and_is_in_list("a", ["STR1"], case_sensitive=False),
is_not_null_and_is_in_list("b", [F.lit(3)], case_sensitive=True),
is_not_null_and_is_in_list(F.col("c").getItem("val"), [F.lit("A")], case_sensitive=False),
is_not_null_and_is_in_list(F.try_element_at("d", F.lit(2)), ["b"], case_sensitive=True),
)

checked_schema = (
Expand All @@ -182,15 +182,15 @@ def test_col_is_not_null_and_is_in_list(spark):
[
[None, "Value '1' in Column 'b' is null or not in the allowed list: [3]", None, None],
[
"Value 'str2' in Column 'a' is null or not in the allowed list: [str1]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, don't remove existing tests.

"Value 'str2' in Column 'a' is null or not in the allowed list: [STR1]",
"Value 'null' in Column 'b' is null or not in the allowed list: [3]",
"Value 'str2' in Column 'UnresolvedExtractValue(c, val)' is null or not in the allowed list: [a]",
"Value 'str2' in Column 'UnresolvedExtractValue(c, val)' is null or not in the allowed list: [A]",
"Value 'a' in Column 'try_element_at(d, 2)' is null or not in the allowed list: [b]",
],
[
"Value ' ' in Column 'a' is null or not in the allowed list: [str1]",
"Value ' ' in Column 'a' is null or not in the allowed list: [STR1]",
None,
"Value ' ' in Column 'UnresolvedExtractValue(c, val)' is null or not in the allowed list: [a]",
"Value ' ' in Column 'UnresolvedExtractValue(c, val)' is null or not in the allowed list: [A]",
"Value ' ' in Column 'try_element_at(d, 2)' is null or not in the allowed list: [b]",
],
],
Expand All @@ -212,10 +212,10 @@ def test_col_is_not_in_list(spark):
)

actual = test_df.select(
is_in_list("a", ["str1"]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, don't remove existing tests.

is_in_list("b", [F.lit(3)]),
is_in_list(F.col("c").getItem("val"), [F.lit("a")]),
is_in_list(F.try_element_at("d", F.lit(2)), ["b"]),
is_in_list("a", ["STR1"], case_sensitive=False),
is_in_list("b", [F.lit(3)], case_sensitive=True),
is_in_list(F.col("c").getItem("val"), [F.lit("A")], case_sensitive=False),
is_in_list(F.try_element_at("d", F.lit(2)), ["b"], case_sensitive=True),
)

checked_schema = (
Expand All @@ -228,13 +228,13 @@ def test_col_is_not_in_list(spark):
[
[None, "Value '1' in Column 'b' is not in the allowed list: [3]", None, None],
[
"Value 'str2' in Column 'a' is not in the allowed list: [str1]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, don't remove existing tests.

"Value 'str2' in Column 'a' is not in the allowed list: [STR1]",
None,
"Value 'str2' in Column 'UnresolvedExtractValue(c, val)' is not in the allowed list: [a]",
"Value 'str2' in Column 'UnresolvedExtractValue(c, val)' is not in the allowed list: [A]",
"Value 'a' in Column 'try_element_at(d, 2)' is not in the allowed list: [b]",
],
[
"Value ' ' in Column 'a' is not in the allowed list: [str1]",
"Value ' ' in Column 'a' is not in the allowed list: [STR1]",
None,
None,
"Value 'a' in Column 'try_element_at(d, 2)' is not in the allowed list: [b]",
Expand Down
Loading