Skip to content

Conversation

@AdityaMandiwal
Copy link
Contributor

issue #638

Changes

Added case-insensitive comparison support to is_in_list and is_not_null_and_is_in_list

Linked issues

Added feature as per issue #638

Resolves #638

Tests

  • [ Yes] manually tested
  • added unit tests
  • [ Yes] added integration tests
  • added end-to-end tests
  • added performance tests

@github-actions
Copy link

github-actions bot commented Oct 31, 2025

❌ 403/404 passed, 1 flaky, 1 failed, 39 skipped, 4h33m43s total

❌ test_e2e_workflow_for_patterns: databricks.sdk.errors.platform.Unknown: run_quality_checker: Please refer to the logs for this run on the triggered run details page. (7m50.108s)
databricks.sdk.errors.platform.Unknown: run_quality_checker: Please refer to the logs for this run on the triggered run details page.
20:38 INFO [databricks.sdk] Using Databricks Metadata Service authentication
[gw7] linux -- Python 3.12.12 /home/runner/work/dqx/dqx/.venv/bin/python
20:38 INFO [databricks.sdk] Using Databricks Metadata Service authentication
20:38 INFO [databricks.labs.dqx.installer.install] Please answer a couple of questions to provide TEST_SCHEMA DQX run configuration. The configuration can also be updated manually after the installation.
20:38 INFO [databricks.labs.dqx.installer.install] DQX will be installed in the TEST_SCHEMA location: '/Users/<your_user>/.dqx'
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Log level
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Should the input data be read using streaming?
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide location for the input data as a path or table in the fully qualified format `catalog.schema.table` or `schema.table`
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide format for the input data (e.g. delta, parquet, csv, json)
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide schema for the input data (e.g. col1 int, col2 string)
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide additional options for reading the input data (e.g. {"versionAsOf": "0"})
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide output table in the fully qualified format `catalog.schema.table` or `schema.table`
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide write mode for output table (e.g. 'append' or 'overwrite')
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide format for the output data (e.g. delta, parquet)
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide additional options for writing the output data (e.g. {"mergeSchema": "true"})
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide quarantined table in the fully qualified format `catalog.schema.table` or `schema.table` (use output table if skipped)
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Do you want to store summary metrics from data quality checking in a table?
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide location of the quality checks definitions, either:
- a filename for storing data quality rules (e.g. checks.yml),
- or a table for storing checks in the format `catalog.schema.table` or `schema.table`,
- or a full volume path in the format /Volumes/catalog/schema/volume/<folder_path>/<file_name_with_extension>,

20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide filename for storing profile summary statistics
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Do you want to use standard job clusters for the workflows execution (not Serverless)?
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide reference tables to use for checks as a dictionary that maps reference table name to reference data location. The specification can contain fields from InputConfig such as: location, format, schema, options and is_streaming fields (e.g. {"reference_vendor":{"location": "catalog.schema.table", "format": "delta"}})
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide custom check functions as a dictionary that maps function name to a python module located in the workspace file (relative or absolute workspace path) or volume (e.g. {"my_func": "/Workspace/Shared/my_module.py"}), 
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: �[1mSelect PRO or SERVERLESS SQL warehouse to run data quality dashboards on�[0m
�[1m[0]�[0m �[36m [Create new PRO or SERVERLESS SQL warehouse ] �[0m
�[1m[1]�[0m �[36mDEFAULT Test Warehouse (TEST_DEFAULT_WAREHOUSE_ID, SERVERLESS, RUNNING)�[0m
Enter a number between 0 and 1
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Open config file in the browser and continue installing? https://DATABRICKS_HOST/#workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.C6Sz/config.yml
20:38 INFO [databricks.labs.dqx.installer.install] Installing DQX v0.9.4+1920251106203827
20:38 INFO [databricks.labs.dqx.installer.dashboard_installer] Creating dashboards...
20:38 INFO [databricks.labs.dqx.installer.dashboard_installer] Reading dashboard assets from /home/runner/work/dqx/dqx/src/databricks/labs/dqx/queries/quality/dashboard...
20:38 INFO [databricks.labs.dqx.installer.dashboard_installer] Using 'main.dqx_test.output_table' output table as the source table for the dashboard...
20:38 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
20:38 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
20:38 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
20:38 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.C6Sz/dashboards'
20:38 INFO [databricks.labs.blueprint.parallel] Installing dashboards 1/1, rps: 1.127/sec
20:38 INFO [databricks.labs.blueprint.parallel] Finished 'Installing dashboards' tasks: 0% results available (0/1). Took 0:00:00.887813
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=46564399039864
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=46564399039864
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=46564399039864
20:38 INFO [databricks.labs.blueprint.parallel] installing components 2/2, rps: 0.167/sec
20:38 INFO [databricks.labs.blueprint.parallel] Finished 'installing components' tasks: 0% results available (0/2). Took 0:00:11.961805
20:38 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
20:38 INFO [databricks.labs.pytester.fixtures.baseline] Created dqx.dummy_s9h62kvix schema: https://DATABRICKS_HOST/#explore/data/dqx/dummy_s9h62kvix
20:38 INFO [databricks.labs.pytester.fixtures.baseline] Created dqx.dummy_s9h62kvix.dummy_t1tnwzwnv schema: https://DATABRICKS_HOST/#explore/data/dqx/dummy_s9h62kvix/dummy_t1tnwzwnv
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Started e2e workflow: https://DATABRICKS_HOST#job/46564399039864/runs/454059117509944
20:46 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
20:46 INFO [databricks.labs.dqx:prepare] DQX v0.9.4+1920251106203827 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.C6Sz/logs/e2e/run-454059117509944-0/prepare.log
20:46 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:prepare] End-to-end: prepare start for patterns: dqx.dummy_s9h62kvix.*
20:46 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
20:38 INFO [databricks.sdk] Using Databricks Metadata Service authentication
20:38 INFO [databricks.labs.dqx.installer.install] Please answer a couple of questions to provide TEST_SCHEMA DQX run configuration. The configuration can also be updated manually after the installation.
20:38 INFO [databricks.labs.dqx.installer.install] DQX will be installed in the TEST_SCHEMA location: '/Users/<your_user>/.dqx'
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Log level
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Should the input data be read using streaming?
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide location for the input data as a path or table in the fully qualified format `catalog.schema.table` or `schema.table`
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide format for the input data (e.g. delta, parquet, csv, json)
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide schema for the input data (e.g. col1 int, col2 string)
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide additional options for reading the input data (e.g. {"versionAsOf": "0"})
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide output table in the fully qualified format `catalog.schema.table` or `schema.table`
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide write mode for output table (e.g. 'append' or 'overwrite')
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide format for the output data (e.g. delta, parquet)
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide additional options for writing the output data (e.g. {"mergeSchema": "true"})
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide quarantined table in the fully qualified format `catalog.schema.table` or `schema.table` (use output table if skipped)
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Do you want to store summary metrics from data quality checking in a table?
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide location of the quality checks definitions, either:
- a filename for storing data quality rules (e.g. checks.yml),
- or a table for storing checks in the format `catalog.schema.table` or `schema.table`,
- or a full volume path in the format /Volumes/catalog/schema/volume/<folder_path>/<file_name_with_extension>,

20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide filename for storing profile summary statistics
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Do you want to use standard job clusters for the workflows execution (not Serverless)?
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide reference tables to use for checks as a dictionary that maps reference table name to reference data location. The specification can contain fields from InputConfig such as: location, format, schema, options and is_streaming fields (e.g. {"reference_vendor":{"location": "catalog.schema.table", "format": "delta"}})
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Provide custom check functions as a dictionary that maps function name to a python module located in the workspace file (relative or absolute workspace path) or volume (e.g. {"my_func": "/Workspace/Shared/my_module.py"}), 
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: �[1mSelect PRO or SERVERLESS SQL warehouse to run data quality dashboards on�[0m
�[1m[0]�[0m �[36m [Create new PRO or SERVERLESS SQL warehouse ] �[0m
�[1m[1]�[0m �[36mDEFAULT Test Warehouse (TEST_DEFAULT_WAREHOUSE_ID, SERVERLESS, RUNNING)�[0m
Enter a number between 0 and 1
20:38 INFO [databricks.labs.blueprint.tui] Asking prompt: Open config file in the browser and continue installing? https://DATABRICKS_HOST/#workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.C6Sz/config.yml
20:38 INFO [databricks.labs.dqx.installer.install] Installing DQX v0.9.4+1920251106203827
20:38 INFO [databricks.labs.dqx.installer.dashboard_installer] Creating dashboards...
20:38 INFO [databricks.labs.dqx.installer.dashboard_installer] Reading dashboard assets from /home/runner/work/dqx/dqx/src/databricks/labs/dqx/queries/quality/dashboard...
20:38 INFO [databricks.labs.dqx.installer.dashboard_installer] Using 'main.dqx_test.output_table' output table as the source table for the dashboard...
20:38 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
20:38 WARNING [databricks.labs.lsql.dashboards] Parsing unsupported field in dashboard.yml: tiles.00_2_dq_error_types.hidden
20:38 WARNING [databricks.labs.lsql.dashboards] Parsing : No expression was parsed from ''
20:38 INFO [databricks.labs.dqx.installer.dashboard_installer] Installing 'DQX_Quality_Dashboard' dashboard in '/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.C6Sz/dashboards'
20:38 INFO [databricks.labs.blueprint.parallel] Installing dashboards 1/1, rps: 1.127/sec
20:38 INFO [databricks.labs.blueprint.parallel] Finished 'Installing dashboards' tasks: 0% results available (0/1). Took 0:00:00.887813
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=profiler
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=quality-checker
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Creating new job configuration for step=e2e
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=46564399039864
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=46564399039864
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Updating configuration for step=e2e job_id=46564399039864
20:38 INFO [databricks.labs.blueprint.parallel] installing components 2/2, rps: 0.167/sec
20:38 INFO [databricks.labs.blueprint.parallel] Finished 'installing components' tasks: 0% results available (0/2). Took 0:00:11.961805
20:38 INFO [databricks.labs.dqx.installer.install] Installation completed successfully!
20:38 INFO [databricks.labs.pytester.fixtures.baseline] Created dqx.dummy_s9h62kvix schema: https://DATABRICKS_HOST/#explore/data/dqx/dummy_s9h62kvix
20:38 INFO [databricks.labs.pytester.fixtures.baseline] Created dqx.dummy_s9h62kvix.dummy_t1tnwzwnv schema: https://DATABRICKS_HOST/#explore/data/dqx/dummy_s9h62kvix/dummy_t1tnwzwnv
20:38 INFO [databricks.labs.dqx.installer.workflow_installer] Started e2e workflow: https://DATABRICKS_HOST#job/46564399039864/runs/454059117509944
20:46 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- REMOTE LOGS --------------
20:46 INFO [databricks.labs.dqx:prepare] DQX v0.9.4+1920251106203827 After workflow finishes, see debug logs at /Workspace/Users/3fe685a1-96cc-4fec-8cdb-6944f5c9787e/.C6Sz/logs/e2e/run-454059117509944-0/prepare.log
20:46 INFO [databricks.labs.dqx.quality_checker.e2e_workflow:prepare] End-to-end: prepare start for patterns: dqx.dummy_s9h62kvix.*
20:46 INFO [databricks.labs.dqx.installer.workflow_installer] ---------- END REMOTE LOGS ----------
20:46 INFO [databricks.labs.blueprint.tui] Asking prompt: Do you want to uninstall DQX from the workspace? this would remove dqx project install_folder, dashboards, and jobs
20:46 INFO [databricks.labs.dqx.installer.install] Deleting DQX v0.9.4+1920251106203827 from https://DATABRICKS_HOST
20:46 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=150901479021274, as it is no longer needed
20:46 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=110220865510400, as it is no longer needed
20:46 INFO [databricks.labs.dqx.installer.workflow_installer] Removing job_id=46564399039864, as it is no longer needed
20:46 INFO [databricks.labs.dqx.installer.install] Uninstalling DQX complete
[gw7] linux -- Python 3.12.12 /home/runner/work/dqx/dqx/.venv/bin/python

Flaky tests:

  • 🤪 test_e2e_workflow_with_custom_install_folder (5m58.11s)

Running from acceptance #3084

@codecov
Copy link

codecov bot commented Oct 31, 2025

Codecov Report

❌ Patch coverage is 78.94737% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 54.14%. Comparing base (a06e918) to head (1dc87a3).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/databricks/labs/dqx/check_funcs.py 85.71% 2 Missing ⚠️
src/databricks/labs/dqx/utils.py 60.00% 2 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (a06e918) and HEAD (1dc87a3). Click for more details.

HEAD has 7 uploads less than BASE
Flag BASE (a06e918) HEAD (1dc87a3)
8 1
Additional details and impacted files
@@             Coverage Diff             @@
##             main     #673       +/-   ##
===========================================
- Coverage   89.77%   54.14%   -35.64%     
===========================================
  Files          56       60        +4     
  Lines        4966     5181      +215     
===========================================
- Hits         4458     2805     -1653     
- Misses        508     2376     +1868     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

| `is_not_null_and_not_empty` | Checks whether the values in the input column are not null and not empty. | `column`: column to check (can be a string column name or a column expression); `trim_strings`: optional boolean flag to trim spaces from strings |
| `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |
| `is_not_null_and_is_in_list` | Checks whether the values in the input column are not null and present in the list of allowed values. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |
| `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the format of other check. Add the new param to the Arguments column, instead of describing it in the Description

| `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |
| `is_not_null_and_is_in_list` | Checks whether the values in the input column are not null and present in the list of allowed values. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |
| `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |
| `is_not_null_and_is_in_list` | Checks whether the values in the input column are not null and present in the list of allowed values. We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please follow the format of other check. Add the new param to the Arguments column, instead of describing it in the Description

)

actual = test_df.select(
is_not_null_and_is_in_list("a", ["str1"]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't remove existing tests. Add additional tests instead.

[
[None, "Value '1' in Column 'b' is null or not in the allowed list: [3]", None, None],
[
"Value 'str2' in Column 'a' is null or not in the allowed list: [str1]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, don't remove existing tests.

)

actual = test_df.select(
is_in_list("a", ["str1"]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, don't remove existing tests.

[
[None, "Value '1' in Column 'b' is not in the allowed list: [3]", None, None],
[
"Value 'str2' in Column 'a' is not in the allowed list: [str1]",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, don't remove existing tests.

allowed_cols_display = [item if isinstance(item, Column) else F.lit(item) for item in allowed]

col_str_norm, col_expr_str, col_expr = _get_normalized_column_and_expr(column)
condition = col_expr.isNull() | ~col_expr.isin(*allowed_cols)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify this pls?
I think we can keep most of the code intact, just apply lower if needed.

for example:

col_expr_compare = F.lower(col_expr) if not case_sensitive else col_expr
allowed_cols_compare = [F.lower(c) for c in allowed_cols_display] if not case_sensitive else allowed_cols_display

condition = col_expr.isNull() | (~col_expr_compare.isin(*allowed_cols_compare))


allowed_cols = [item if isinstance(item, Column) else F.lit(item) for item in allowed]
# Keep original values for display in error message
allowed_cols_display = [item if isinstance(item, Column) else F.lit(item) for item in allowed]
Copy link
Contributor

@mwojtyczka mwojtyczka Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above, simplify pls

@mwojtyczka mwojtyczka requested a review from Copilot October 31, 2025 12:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds case-insensitive comparison support to the is_in_list and is_not_null_and_is_in_list functions as requested in issue #638.

  • Added optional case_sensitive parameter (defaults to True) to both functions
  • Implemented case-insensitive comparison by applying F.lower() to both the column values and allowed values
  • Updated integration tests to verify the new functionality with both case-sensitive and case-insensitive scenarios

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

File Description
src/databricks/labs/dqx/check_funcs.py Added case_sensitive parameter and logic for case-insensitive comparisons
tests/integration/test_row_checks.py Updated test cases to use the new case_sensitive parameter
docs/dqx/docs/reference/quality_checks.mdx Updated documentation to mention the case_sensitive parameter

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 31 to 32
| `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |
| `is_not_null_and_is_in_list` | Checks whether the values in the input column are not null and present in the list of allowed values. We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |
Copy link

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The parameter documentation in the third column doesn't mention the new case_sensitive parameter. The parameters column should be updated to include: case_sensitive: optional boolean flag for case-sensitive comparison (default: True).

Suggested change
| `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |
| `is_not_null_and_is_in_list` | Checks whether the values in the input column are not null and present in the list of allowed values. We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values |
| `is_in_list` | Checks whether the values in the input column are present in the list of allowed values (null values are allowed). We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values; `case_sensitive`: optional boolean flag for case-sensitive comparison (default: True) |
| `is_not_null_and_is_in_list` | Checks whether the values in the input column are not null and present in the list of allowed values. We can pass an additional case_sensitive parameter as False for a case insensitive check. This check is not suited for large lists of allowed values. In such cases, it’s recommended to use the `foreign_key` dataset-level check instead. | `column`: column to check (can be a string column name or a column expression); `allowed`: list of allowed values; `case_sensitive`: optional boolean flag for case-sensitive comparison (default: True) |

Copilot uses AI. Check for mistakes.

# Apply case-insensitive transformation if needed
col_expr_compare = F.lower(col_expr) if not case_sensitive else col_expr
allowed_cols_compare = [F.lower(c) if not case_sensitive else c for c in allowed_cols]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move the ternary outside of the list so that we are not doing the comparison for every element:

allowed_cols_compare = [F.lower(c) for c in allowed_cols] if not case_sensitive else allowed_cols

Comment on lines 158 to 159
# Apply case-insensitive transformation if needed
col_expr_compare = F.lower(col_expr) if not case_sensitive else col_expr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to define and handle what case insensitivity means for complex objects (e.g. arrays, maps, structs). We probably want case insensitivity for all keys and values in MapType and StructType columns and case insensitivity for all items in ArrayType columns.

lower(col) will throw an error for complex data types. We might want to create a private helper function to handle casing based on the column type.

is_not_null_and_is_in_list("a", ["str1"], case_sensitive=False),
is_not_null_and_is_in_list(F.col("c").getItem("val"), [F.lit("a")], case_sensitive=False),
is_not_null_and_is_in_list(F.try_element_at("d", F.lit(2)), ["b"], case_sensitive=False),
is_not_null_and_is_in_list("d", [["a", "b"], ["B", "c"]], case_sensitive=False),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add a case-sensitive test for arrays?

def is_in_list(column: str | Column, allowed: list, case_sensitive: bool = True) -> Column:
"""Checks whether the values in the input column are present in the list of allowed values
(null values are allowed).
(null values are allowed). Can optionally perform a case-insensitive comparison.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe note the limitations for MapType and StructType here in the docstring.

def is_not_null_and_is_in_list(column: str | Column, allowed: list) -> Column:
def is_not_null_and_is_in_list(column: str | Column, allowed: list, case_sensitive: bool = True) -> Column:
"""Checks whether the values in the input column are not null and present in the list of allowed values.
Can optionally perform a case-insensitive comparison.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe note the limitations for MapType and StructType here in the docstring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Update is_in_list and is_not_null_and_is_in_list functions to optionally handle case sensitivity

4 participants