Added case-insensitive comparison support to is_in_list and is_not_null_and_is_in_list #673

AdityaMandiwal · 2025-10-31T09:54:29Z

issue #638

Changes

Added case-insensitive comparison support to is_in_list and is_not_null_and_is_in_list

Linked issues

Added feature as per issue #638

Resolves #638

Tests

[ Yes] manually tested
added unit tests
[ Yes] added integration tests
added end-to-end tests
added performance tests

…ll_and_is_in_list

github-actions · 2025-10-31T10:17:01Z

❌ 403/404 passed, 1 flaky, 1 failed, 39 skipped, 4h33m43s total

❌ test_e2e_workflow_for_patterns: databricks.sdk.errors.platform.Unknown: run_quality_checker: Please refer to the logs for this run on the triggered run details page. (7m50.108s)

databricks.sdk.errors.platform.Unknown: run_quality_checker: Please refer to the logs for this run on the triggered run details page.
20:38 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw7] linux -- Python 3.12.12 /home/runner/work/dqx/dqx/.venv/bin/python
20:38 INFO [databricks.sdk] Using Databricks Metadata Service authentication
20:38 INFO [databricks.labs.dqx.installer.install] Please answer a couple of questions to provide TEST_SCHEMA DQX run configuration. The configuration can also be updated manually after the installation.
20:38 INFO [databricks.labs.dqx.installer.install] DQX will be installed in the TEST_SCHEMA location: '/Users/<your_user>/.dqx'
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Log level
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Should the input data be read using streaming?
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide location for the input data as a path or table in the fully qualified format `catalog.schema.table` or `schema.table`
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide format for the input data (e.g. delta, parquet, csv, json)
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide schema for the input data (e.g. col1 int, col2 string)
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide additional options for reading the input data (e.g. {"versionAsOf": "0"})
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide output table in the fully qualified format `catalog.schema.table` or `schema.table`
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide write mode for output table (e.g. 'append' or 'overwrite')
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide format for the output data (e.g. delta, parquet)
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide additional options for writing the output data (e.g. {"mergeSchema": "true"})
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide quarantined table in the fully qualified format `catalog.schema.table` or `schema.table` (use output table if skipped)
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Do you want to store summary metrics from data quality checking in a table?
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide location of the quality checks definitions, either:
- a filename for storing data quality rules (e.g. checks.yml),
- or a table for storing checks in the format `catalog.schema.table` or `schema.table`,
- or a full volume path in the format /Volumes/catalog/schema/volume/<folder_path>/<file_name_with_extension>,

20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide filename for storing profile summary statistics
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Do you want to use standard job clusters for the workflows execution (not Serverless)?
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide reference tables to use for checks as a dictionary that maps reference table name to reference data location. The specification can contain fields from InputConfig such as: location, format, schema, options and is_streaming fields (e.g. {"reference_vendor":{"location": "catalog.schema.table", "format": "delta"}})
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide custom check functions as a dictionary that maps function name to a python module located in the workspace file (relative or absolute workspace path) or volume (e.g. {"my_func": "/Workspace/Shared/my_module.py"}), 
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: �[1mSelect PRO or SERVERLESS SQL warehouse to run data quality dashboards on�[0m
�[1m[0]�[0m �[36m [Create new PRO or SERVERLESS SQL warehouse ] �[0m
�[1m[1]�[0m �[36mDEFAULT Test Warehouse (TEST_DEFAULT_WAREHOUSE_ID, SERVERLESS, RUNNING)�[0m
Enter a number between 0 and 1
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Open config file in the browser and continue installing? https://DATABRICKS_HOST/#workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.C6Sz/config.yml
20:38 INFO [databricks.labs.dqx.installer.install] Installing DQX v0.9.4+1920251106203827
20:38 INFO [databricks.labs.dqx.installer.dashboard_installer] Creating dashboards...
20:38 INFO [databricks.labs.dqx.installer.dashboard_installer] Reading dashboard assets from /home/runner/work/dqx/dqx/src/databricks/labs/dqx/queries/quality/dashboard...
20:38 INFO [databricks.labs.dqx.installer.dashboard_installer] Using 'main.dqx_test.output_table' output table as the source table for the dashboard...
20:38 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
20:38 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
20:38 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
20:38 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.C6Sz/dashboards'
20:38 INFO [databricks.labs.blueprint.parallel] Installing dashboards 1/1, rps: 1.127/sec
20:38 INFO [databricks.labs.blueprint.parallel] Finished 'Installing dashboards' tasks: 0% results available (0/1). Took 0:00:00.887813
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=46564399039864
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=46564399039864
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=46564399039864
20:38 INFO [databricks.labs.blueprint.parallel] installing components 2/2, rps: 0.167/sec
20:38 INFO [databricks.labs.blueprint.parallel] Finished 'installing components' tasks: 0% results available (0/2). Took 0:00:11.961805
20:38 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
20:38 INFO [databricks.labs.pytester.fixtures.baseline] Created dqx.dummy_s9h62kvix schema: https://DATABRICKS_HOST/#explore/data/dqx/dummy_s9h62kvix
20:38 INFO [databricks.labs.pytester.fixtures.baseline] Created dqx.dummy_s9h62kvix.dummy_t1tnwzwnv schema: https://DATABRICKS_HOST/#explore/data/dqx/dummy_s9h62kvix/dummy_t1tnwzwnv
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Started e2e workflow: https://DATABRICKS_HOST#job/46564399039864/runs/454059117509944
20:46 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
20:46 INFO [databricks.labs.dqx:prepare] DQX v0.9.4+1920251106203827 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.C6Sz/logs/e2e/run-454059117509944-0/prepare.log
20:46 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:prepare] End-to-end: prepare start for patterns: dqx.dummy_s9h62kvix.*
20:46 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
20:38 INFO [databricks.sdk] Using Databricks Metadata Service authentication
20:38 INFO [databricks.labs.dqx.installer.install] Please answer a couple of questions to provide TEST_SCHEMA DQX run configuration. The configuration can also be updated manually after the installation.
20:38 INFO [databricks.labs.dqx.installer.install] DQX will be installed in the TEST_SCHEMA location: '/Users/<your_user>/.dqx'
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Log level
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Should the input data be read using streaming?
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide location for the input data as a path or table in the fully qualified format `catalog.schema.table` or `schema.table`
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide format for the input data (e.g. delta, parquet, csv, json)
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide schema for the input data (e.g. col1 int, col2 string)
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide additional options for reading the input data (e.g. {"versionAsOf": "0"})
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide output table in the fully qualified format `catalog.schema.table` or `schema.table`
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide write mode for output table (e.g. 'append' or 'overwrite')
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide format for the output data (e.g. delta, parquet)
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide additional options for writing the output data (e.g. {"mergeSchema": "true"})
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide quarantined table in the fully qualified format `catalog.schema.table` or `schema.table` (use output table if skipped)
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Do you want to store summary metrics from data quality checking in a table?
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide location of the quality checks definitions, either:
- a filename for storing data quality rules (e.g. checks.yml),
- or a table for storing checks in the format `catalog.schema.table` or `schema.table`,
- or a full volume path in the format /Volumes/catalog/schema/volume/<folder_path>/<file_name_with_extension>,

20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide filename for storing profile summary statistics
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Do you want to use standard job clusters for the workflows execution (not Serverless)?
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide reference tables to use for checks as a dictionary that maps reference table name to reference data location. The specification can contain fields from InputConfig such as: location, format, schema, options and is_streaming fields (e.g. {"reference_vendor":{"location": "catalog.schema.table", "format": "delta"}})
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide custom check functions as a dictionary that maps function name to a python module located in the workspace file (relative or absolute workspace path) or volume (e.g. {"my_func": "/Workspace/Shared/my_module.py"}), 
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: �[1mSelect PRO or SERVERLESS SQL warehouse to run data quality dashboards on�[0m
�[1m[0]�[0m �[36m [Create new PRO or SERVERLESS SQL warehouse ] �[0m
�[1m[1]�[0m �[36mDEFAULT Test Warehouse (TEST_DEFAULT_WAREHOUSE_ID, SERVERLESS, RUNNING)�[0m
Enter a number between 0 and 1
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Open config file in the browser and continue installing? https://DATABRICKS_HOST/#workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.C6Sz/config.yml
20:38 INFO [databricks.labs.dqx.installer.install] Installing DQX v0.9.4+1920251106203827
20:38 INFO [databricks.labs.dqx.installer.dashboard_installer] Creating dashboards...
20:38 INFO [databricks.labs.dqx.installer.dashboard_installer] Reading dashboard assets from /home/runner/work/dqx/dqx/src/databricks/labs/dqx/queries/quality/dashboard...
20:38 INFO [databricks.labs.dqx.installer.dashboard_installer] Using 'main.dqx_test.output_table' output table as the source table for the dashboard...
20:38 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
20:38 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
20:38 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
20:38 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.C6Sz/dashboards'
20:38 INFO [databricks.labs.blueprint.parallel] Installing dashboards 1/1, rps: 1.127/sec
20:38 INFO [databricks.labs.blueprint.parallel] Finished 'Installing dashboards' tasks: 0% results available (0/1). Took 0:00:00.887813
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=46564399039864
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=46564399039864
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=46564399039864
20:38 INFO [databricks.labs.blueprint.parallel] installing components 2/2, rps: 0.167/sec
20:38 INFO [databricks.labs.blueprint.parallel] Finished 'installing components' tasks: 0% results available (0/2). Took 0:00:11.961805
20:38 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
20:38 INFO [databricks.labs.pytester.fixtures.baseline] Created dqx.dummy_s9h62kvix schema: https://DATABRICKS_HOST/#explore/data/dqx/dummy_s9h62kvix
20:38 INFO [databricks.labs.pytester.fixtures.baseline] Created dqx.dummy_s9h62kvix.dummy_t1tnwzwnv schema: https://DATABRICKS_HOST/#explore/data/dqx/dummy_s9h62kvix/dummy_t1tnwzwnv
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Started e2e workflow: https://DATABRICKS_HOST#job/46564399039864/runs/454059117509944
20:46 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
20:46 INFO [databricks.labs.dqx:prepare] DQX v0.9.4+1920251106203827 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.C6Sz/logs/e2e/run-454059117509944-0/prepare.log
20:46 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:prepare] End-to-end: prepare start for patterns: dqx.dummy_s9h62kvix.*
20:46 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
20:46 INFO [databricks.labs.blueprint.tui] Asking prompt: Do you want to uninstall DQX from the workspace? this would remove dqx project install_folder, dashboards, and jobs
20:46 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.4+1920251106203827 from https://DATABRICKS_HOST
20:46 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=150901479021274, as it is no longer needed
20:46 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=110220865510400, as it is no longer needed
20:46 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=46564399039864, as it is no longer needed
20:46 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw7] linux -- Python 3.12.12 /home/runner/work/dqx/dqx/.venv/bin/python

Flaky tests:

🤪 test_e2e_workflow_with_custom_install_folder (5m58.11s)

_{Running from acceptance #3084}

codecov · 2025-10-31T10:55:34Z

Codecov Report

❌ Patch coverage is 78.94737% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.14%. Comparing base (a06e918) to head (1dc87a3).
⚠️ Report is 2 commits behind head on main.

Files with missing lines	Patch %	Lines
src/databricks/labs/dqx/check_funcs.py	85.71%	2 Missing ⚠️
src/databricks/labs/dqx/utils.py	60.00%	2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (a06e918) and HEAD (1dc87a3). Click for more details.

HEAD has 7 uploads less than BASE

Flag BASE (a06e918) HEAD (1dc87a3)

8 1

Additional details and impacted files

@@             Coverage Diff             @@
##             main     #673       +/-   ##
===========================================
- Coverage   89.77%   54.14%   -35.64%     
===========================================
  Files          56       60        +4     
  Lines        4966     5181      +215     
===========================================
- Hits         4458     2805     -1653     
- Misses        508     2376     +1868

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

mwojtyczka · 2025-10-31T12:42:21Z

docs/dqx/docs/reference/quality_checks.mdx

 | `is_not_null_and_not_empty`        | Checks whether the values in the input column are not null and not empty.                                                                                                                                                                                                                                                                                                                                                         | `column`: column to check (can be a string column name or a column expression); `trim_strings`: optional boolean flag to trim spaces from strings                                                                                                                                                                                                                                                                                                                                                                                                    |
-| `is_in_list`                       | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead.                                                                                                                                                              | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values                                                                                                                                                                                                                                                                                                                                                                                                                                    |
-| `is_not_null_and_is_in_list`       | Checks whether the values in the input column are not null and present in the list of allowed values. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead.                                                                                                                                                                           | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| `is_in_list`                       | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead.                                                                                                                                                              | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values                                                                                                                                                                                                                                                                                                                                                                                                                                    |


Please follow the format of other check. Add the new param to the Arguments column, instead of describing it in the Description

mwojtyczka · 2025-10-31T12:42:25Z

docs/dqx/docs/reference/quality_checks.mdx

-| `is_in_list`                       | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead.                                                                                                                                                              | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values                                                                                                                                                                                                                                                                                                                                                                                                                                    |
-| `is_not_null_and_is_in_list`       | Checks whether the values in the input column are not null and present in the list of allowed values. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead.                                                                                                                                                                           | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| `is_in_list`                       | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead.                                                                                                                                                              | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| `is_not_null_and_is_in_list`       | Checks whether the values in the input column are not null and present in the list of allowed values. We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead.                                                                                                                                                                           | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values                                                                                                                                                                                                                                                                                                                                                                                                                                    |


Please follow the format of other check. Add the new param to the Arguments column, instead of describing it in the Description

mwojtyczka · 2025-10-31T12:43:52Z

tests/integration/test_row_checks.py

    )

    actual = test_df.select(
-        is_not_null_and_is_in_list("a", ["str1"]),


please don't remove existing tests. Add additional tests instead.

mwojtyczka · 2025-10-31T12:44:08Z

tests/integration/test_row_checks.py

        [
            [None, "Value '1' in Column 'b' is null or not in the allowed list: [3]", None, None],
            [
-                "Value 'str2' in Column 'a' is null or not in the allowed list: [str1]",


same as above, don't remove existing tests.

mwojtyczka · 2025-10-31T12:44:24Z

tests/integration/test_row_checks.py

    )

    actual = test_df.select(
-        is_in_list("a", ["str1"]),


same as above, don't remove existing tests.

mwojtyczka · 2025-10-31T12:44:48Z

tests/integration/test_row_checks.py

        [
            [None, "Value '1' in Column 'b' is not in the allowed list: [3]", None, None],
            [
-                "Value 'str2' in Column 'a' is not in the allowed list: [str1]",


same as above, don't remove existing tests.

mwojtyczka · 2025-10-31T12:49:47Z

src/databricks/labs/dqx/check_funcs.py

+    allowed_cols_display = [item if isinstance(item, Column) else F.lit(item) for item in allowed]
+
    col_str_norm, col_expr_str, col_expr = _get_normalized_column_and_expr(column)
-    condition = col_expr.isNull() | ~col_expr.isin(*allowed_cols)


Can we simplify this pls?
I think we can keep most of the code intact, just apply lower if needed.

for example:

col_expr_compare = F.lower(col_expr) if not case_sensitive else col_expr allowed_cols_compare = [F.lower(c) for c in allowed_cols_display] if not case_sensitive else allowed_cols_display condition = col_expr.isNull() | (~col_expr_compare.isin(*allowed_cols_compare))

mwojtyczka · 2025-10-31T12:50:02Z

src/databricks/labs/dqx/check_funcs.py


-    allowed_cols = [item if isinstance(item, Column) else F.lit(item) for item in allowed]
+    # Keep original values for display in error message
+    allowed_cols_display = [item if isinstance(item, Column) else F.lit(item) for item in allowed]


Same as above, simplify pls

Copilot

Pull Request Overview

This PR adds case-insensitive comparison support to the is_in_list and is_not_null_and_is_in_list functions as requested in issue #638.

Added optional case_sensitive parameter (defaults to True) to both functions
Implemented case-insensitive comparison by applying F.lower() to both the column values and allowed values
Updated integration tests to verify the new functionality with both case-sensitive and case-insensitive scenarios

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File	Description
src/databricks/labs/dqx/check_funcs.py	Added case_sensitive parameter and logic for case-insensitive comparisons
tests/integration/test_row_checks.py	Updated test cases to use the new case_sensitive parameter
docs/dqx/docs/reference/quality_checks.mdx	Updated documentation to mention the case_sensitive parameter

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-31T12:54:46Z

docs/dqx/docs/reference/quality_checks.mdx

+| `is_in_list`                       | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead.                                                                                                                                                              | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values                                                                                                                                                                                                                                                                                                                                                                                                                                    |
+| `is_not_null_and_is_in_list`       | Checks whether the values in the input column are not null and present in the list of allowed values. We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead.                                                                                                                                                                           | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values                                                                                                                                                                                                                                                                                                                                                                                                                                    |


The parameter documentation in the third column doesn't mention the new case_sensitive parameter. The parameters column should be updated to include: case_sensitive: optional boolean flag for case-sensitive comparison (default: True).

Suggested change

| `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |

| `is_not_null_and_is_in_list` | Checks whether the values in the input column are not null and present in the list of allowed values. We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |

| `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values; `case_sensitive`: optional boolean flag for case-sensitive comparison (default: True) |

| `is_not_null_and_is_in_list` | Checks whether the values in the input column are not null and present in the list of allowed values. We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values; `case_sensitive`: optional boolean flag for case-sensitive comparison (default: True) |

ghanse · 2025-11-03T22:18:27Z

src/databricks/labs/dqx/check_funcs.py

+
+    # Apply case-insensitive transformation if needed
+    col_expr_compare = F.lower(col_expr) if not case_sensitive else col_expr
+    allowed_cols_compare = [F.lower(c) if not case_sensitive else c for c in allowed_cols]


We should move the ternary outside of the list so that we are not doing the comparison for every element:

allowed_cols_compare = [F.lower(c) for c in allowed_cols] if not case_sensitive else allowed_cols

ghanse · 2025-11-03T22:40:16Z

src/databricks/labs/dqx/check_funcs.py

+    # Apply case-insensitive transformation if needed
+    col_expr_compare = F.lower(col_expr) if not case_sensitive else col_expr


We need to define and handle what case insensitivity means for complex objects (e.g. arrays, maps, structs). We probably want case insensitivity for all keys and values in MapType and StructType columns and case insensitivity for all items in ArrayType columns.

lower(col) will throw an error for complex data types. We might want to create a private helper function to handle casing based on the column type.

ghanse · 2025-11-06T14:20:23Z

tests/integration/test_row_checks.py

+        is_not_null_and_is_in_list("a", ["str1"], case_sensitive=False),
+        is_not_null_and_is_in_list(F.col("c").getItem("val"), [F.lit("a")], case_sensitive=False),
+        is_not_null_and_is_in_list(F.try_element_at("d", F.lit(2)), ["b"], case_sensitive=False),
+        is_not_null_and_is_in_list("d", [["a", "b"], ["B", "c"]], case_sensitive=False),


Can we also add a case-sensitive test for arrays?

ghanse · 2025-11-06T14:23:00Z

src/databricks/labs/dqx/check_funcs.py

+def is_in_list(column: str | Column, allowed: list, case_sensitive: bool = True) -> Column:
    """Checks whether the values in the input column are present in the list of allowed values
-    (null values are allowed).
+    (null values are allowed). Can optionally perform a case-insensitive comparison.


Maybe note the limitations for MapType and StructType here in the docstring.

ghanse · 2025-11-06T14:23:22Z

src/databricks/labs/dqx/check_funcs.py

-def is_not_null_and_is_in_list(column: str | Column, allowed: list) -> Column:
+def is_not_null_and_is_in_list(column: str | Column, allowed: list, case_sensitive: bool = True) -> Column:
    """Checks whether the values in the input column are not null and present in the list of allowed values.
+    Can optionally perform a case-insensitive comparison.


Maybe note the limitations for MapType and StructType here in the docstring.

Added case-insensitive comparison support to is_in_list and is_not_nu…

b2d19b0

…ll_and_is_in_list

AdityaMandiwal requested a review from a team as a code owner October 31, 2025 09:54

AdityaMandiwal requested review from grusin-db and removed request for a team October 31, 2025 09:54

AdityaMandiwal temporarily deployed to tool October 31, 2025 09:54 — with GitHub Actions Inactive

AdityaMandiwal had a problem deploying to tool October 31, 2025 09:54 — with GitHub Actions Failure

AdityaMandiwal temporarily deployed to tool October 31, 2025 09:54 — with GitHub Actions Inactive

AdityaMandiwal had a problem deploying to tool October 31, 2025 10:56 — with GitHub Actions Failure

Merge branch 'main' into feature_638

27e3e3a

mwojtyczka had a problem deploying to tool October 31, 2025 12:41 — with GitHub Actions Failure

mwojtyczka temporarily deployed to tool October 31, 2025 12:41 — with GitHub Actions Inactive

mwojtyczka had a problem deploying to tool October 31, 2025 12:41 — with GitHub Actions Failure

mwojtyczka temporarily deployed to tool October 31, 2025 12:41 — with GitHub Actions Inactive

mwojtyczka requested changes Oct 31, 2025

View reviewed changes

mwojtyczka requested a review from Copilot October 31, 2025 12:53

Copilot AI reviewed Oct 31, 2025

View reviewed changes

fixed documentation and improved integration tests

b24e26a

AdityaMandiwal had a problem deploying to tool November 3, 2025 16:03 — with GitHub Actions Failure

AdityaMandiwal temporarily deployed to tool November 3, 2025 16:03 — with GitHub Actions Inactive

ghanse added 2 commits November 3, 2025 17:08

Fix tests

f70eb14

Merge remote-tracking branch 'origin/feature_638' into feature_638

8215377

ghanse had a problem deploying to tool November 3, 2025 22:09 — with GitHub Actions Failure

ghanse had a problem deploying to tool November 3, 2025 22:09 — with GitHub Actions Error

Fix tests

851f6fd

ghanse temporarily deployed to tool November 3, 2025 22:14 — with GitHub Actions Inactive

ghanse reviewed Nov 3, 2025

View reviewed changes

Added lowercase checks for array type columns and fixed documentation

cee3812

AdityaMandiwal temporarily deployed to tool November 5, 2025 19:26 — with GitHub Actions Inactive

ghanse reviewed Nov 6, 2025

View reviewed changes

Added additional tests

1dc87a3

AdityaMandiwal had a problem deploying to tool November 6, 2025 20:04 — with GitHub Actions Failure

AdityaMandiwal temporarily deployed to tool November 6, 2025 20:04 — with GitHub Actions Inactive

AdityaMandiwal had a problem deploying to tool November 6, 2025 20:04 — with GitHub Actions Failure

AdityaMandiwal temporarily deployed to tool November 6, 2025 20:04 — with GitHub Actions Inactive

		\| `is_in_list` \| Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. \| `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values \|
		\| `is_not_null_and_is_in_list` \| Checks whether the values in the input column are not null and present in the list of allowed values. We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. \| `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values \|

		# Apply case-insensitive transformation if needed
		col_expr_compare = F.lower(col_expr) if not case_sensitive else col_expr

Added case-insensitive comparison support to is_in_list and is_not_null_and_is_in_list #673

Are you sure you want to change the base?

Added case-insensitive comparison support to is_in_list and is_not_null_and_is_in_list #673

Conversation

AdityaMandiwal commented Oct 31, 2025

Changes

Linked issues

Tests

Uh oh!

github-actions bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mwojtyczka Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Oct 31, 2025 •

edited

Loading

codecov bot commented Oct 31, 2025 •

edited

Loading

mwojtyczka Oct 31, 2025 •

edited

Loading