-
Notifications
You must be signed in to change notification settings - Fork 69
Added case-insensitive comparison support to is_in_list and is_not_null_and_is_in_list #673
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…ll_and_is_in_list
|
❌ 403/404 passed, 1 flaky, 1 failed, 39 skipped, 4h33m43s total ❌ test_e2e_workflow_for_patterns: databricks.sdk.errors.platform.Unknown: run_quality_checker: Please refer to the logs for this run on the triggered run details page. (7m50.108s)Flaky tests:
Running from acceptance #3084 |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #673 +/- ##
===========================================
- Coverage 89.77% 54.14% -35.64%
===========================================
Files 56 60 +4
Lines 4966 5181 +215
===========================================
- Hits 4458 2805 -1653
- Misses 508 2376 +1868 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| | `is_not_null_and_not_empty` | Checks whether the values in the input column are not null and not empty. | `column`: column to check (can be a string column name or a column expression); `trim_strings`: optional boolean flag to trim spaces from strings | | ||
| | `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values | | ||
| | `is_not_null_and_is_in_list` | Checks whether the values in the input column are not null and present in the list of allowed values. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values | | ||
| | `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the format of other check. Add the new param to the Arguments column, instead of describing it in the Description
| | `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values | | ||
| | `is_not_null_and_is_in_list` | Checks whether the values in the input column are not null and present in the list of allowed values. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values | | ||
| | `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values | | ||
| | `is_not_null_and_is_in_list` | Checks whether the values in the input column are not null and present in the list of allowed values. We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please follow the format of other check. Add the new param to the Arguments column, instead of describing it in the Description
| ) | ||
|
|
||
| actual = test_df.select( | ||
| is_not_null_and_is_in_list("a", ["str1"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please don't remove existing tests. Add additional tests instead.
| [ | ||
| [None, "Value '1' in Column 'b' is null or not in the allowed list: [3]", None, None], | ||
| [ | ||
| "Value 'str2' in Column 'a' is null or not in the allowed list: [str1]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, don't remove existing tests.
| ) | ||
|
|
||
| actual = test_df.select( | ||
| is_in_list("a", ["str1"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, don't remove existing tests.
| [ | ||
| [None, "Value '1' in Column 'b' is not in the allowed list: [3]", None, None], | ||
| [ | ||
| "Value 'str2' in Column 'a' is not in the allowed list: [str1]", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, don't remove existing tests.
| allowed_cols_display = [item if isinstance(item, Column) else F.lit(item) for item in allowed] | ||
|
|
||
| col_str_norm, col_expr_str, col_expr = _get_normalized_column_and_expr(column) | ||
| condition = col_expr.isNull() | ~col_expr.isin(*allowed_cols) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify this pls?
I think we can keep most of the code intact, just apply lower if needed.
for example:
col_expr_compare = F.lower(col_expr) if not case_sensitive else col_expr
allowed_cols_compare = [F.lower(c) for c in allowed_cols_display] if not case_sensitive else allowed_cols_display
condition = col_expr.isNull() | (~col_expr_compare.isin(*allowed_cols_compare))
|
|
||
| allowed_cols = [item if isinstance(item, Column) else F.lit(item) for item in allowed] | ||
| # Keep original values for display in error message | ||
| allowed_cols_display = [item if isinstance(item, Column) else F.lit(item) for item in allowed] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, simplify pls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds case-insensitive comparison support to the is_in_list and is_not_null_and_is_in_list functions as requested in issue #638.
- Added optional
case_sensitiveparameter (defaults toTrue) to both functions - Implemented case-insensitive comparison by applying
F.lower()to both the column values and allowed values - Updated integration tests to verify the new functionality with both case-sensitive and case-insensitive scenarios
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/databricks/labs/dqx/check_funcs.py | Added case_sensitive parameter and logic for case-insensitive comparisons |
| tests/integration/test_row_checks.py | Updated test cases to use the new case_sensitive parameter |
| docs/dqx/docs/reference/quality_checks.mdx | Updated documentation to mention the case_sensitive parameter |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| | `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values | | ||
| | `is_not_null_and_is_in_list` | Checks whether the values in the input column are not null and present in the list of allowed values. We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values | |
Copilot
AI
Oct 31, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parameter documentation in the third column doesn't mention the new case_sensitive parameter. The parameters column should be updated to include: case_sensitive: optional boolean flag for case-sensitive comparison (default: True).
| | `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values | | |
| | `is_not_null_and_is_in_list` | Checks whether the values in the input column are not null and present in the list of allowed values. We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values | | |
| | `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values; `case_sensitive`: optional boolean flag for case-sensitive comparison (default: True) | | |
| | `is_not_null_and_is_in_list` | Checks whether the values in the input column are not null and present in the list of allowed values. We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values; `case_sensitive`: optional boolean flag for case-sensitive comparison (default: True) | |
|
|
||
| # Apply case-insensitive transformation if needed | ||
| col_expr_compare = F.lower(col_expr) if not case_sensitive else col_expr | ||
| allowed_cols_compare = [F.lower(c) if not case_sensitive else c for c in allowed_cols] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should move the ternary outside of the list so that we are not doing the comparison for every element:
allowed_cols_compare = [F.lower(c) for c in allowed_cols] if not case_sensitive else allowed_cols
| # Apply case-insensitive transformation if needed | ||
| col_expr_compare = F.lower(col_expr) if not case_sensitive else col_expr |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to define and handle what case insensitivity means for complex objects (e.g. arrays, maps, structs). We probably want case insensitivity for all keys and values in MapType and StructType columns and case insensitivity for all items in ArrayType columns.
lower(col) will throw an error for complex data types. We might want to create a private helper function to handle casing based on the column type.
| is_not_null_and_is_in_list("a", ["str1"], case_sensitive=False), | ||
| is_not_null_and_is_in_list(F.col("c").getItem("val"), [F.lit("a")], case_sensitive=False), | ||
| is_not_null_and_is_in_list(F.try_element_at("d", F.lit(2)), ["b"], case_sensitive=False), | ||
| is_not_null_and_is_in_list("d", [["a", "b"], ["B", "c"]], case_sensitive=False), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also add a case-sensitive test for arrays?
| def is_in_list(column: str | Column, allowed: list, case_sensitive: bool = True) -> Column: | ||
| """Checks whether the values in the input column are present in the list of allowed values | ||
| (null values are allowed). | ||
| (null values are allowed). Can optionally perform a case-insensitive comparison. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe note the limitations for MapType and StructType here in the docstring.
| def is_not_null_and_is_in_list(column: str | Column, allowed: list) -> Column: | ||
| def is_not_null_and_is_in_list(column: str | Column, allowed: list, case_sensitive: bool = True) -> Column: | ||
| """Checks whether the values in the input column are not null and present in the list of allowed values. | ||
| Can optionally perform a case-insensitive comparison. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe note the limitations for MapType and StructType here in the docstring.
issue #638
Changes
Added case-insensitive comparison support to is_in_list and is_not_null_and_is_in_list
Linked issues
Added feature as per issue #638
Resolves #638
Tests