v0.5.0
What's Changed
- Fix spark remote version detection in CI by @mwojtyczka in #342
- Fix spark remote installation by @mwojtyczka in #346
- Load and save checks from a Delta table by @ghanse in #339
- Handle nulls in uniqueness check for composite keys to conform with the SQL ANSI Standard by @mwojtyczka in #345
Corrected is_unique check to consistently handle nulls. To conform with the SQL ANSI Standard), null values are treated now as unknown, thus not duplicates, e.g. "(NULL, NULL) not equals (NULL, NULL); (1, NULL) not equals (1, NULL). A new parameter called nulls_distinct has been added to the check which is set to True by default. If set to False, null values are treated as duplicates, e.g. eg. (1, NULL) equals (1, NULL) and (NULL, NULL) equals (NULL, NULL)
Added columns parameter to the is_unique check to be able to pass list of columns to handle composite keys.
- Allow user metadata for individual checks by @ghanse in #352
- Add save_results_in_table function by @karthik-ballullaya-db in #319
- Fix older than by @ghanse in #354
- Add PII-detection example by @ghanse in #358
- Add aggregation type of checks (count/min/max/sum/avg/mean over a group of records) by @mwojtyczka in #357
- Updated documentation to provide guidance on applying quality checks to multiple data sets.
BREAKING CHANGES AND HOW TO UPGRADE!
This release introduces changes to the syntax of quality rules to ensure we can scale DQX functionality in the future. Please update your quality rules accordingly.
- Renamed
col_namefield in the checks tocolumnto better reflect the fact that it can be a column name or column expression.
From:
- check:
function: is_not_null
arguments:
col_name: col1
To:
- check:
function: is_not_null
arguments:
column: col1
- Similarly, custom checks must define
columninstead ofcol_nameas first argument. - Renamed and changed data type of
col_nametocolumnsin the reporting columns. DQX stores list of columns instead of a single column now to handle checks that take multiple columns as input. - Renamed
col_namestofor_each_columnand moved it from check function arguments the check level.
From:
- check:
function: is_not_null
arguments:
col_names:
- col1
- col2
To:
- check:
function: is_not_null
for_each_column:
- col1
- col2
- Renamed
DQColRuletoDQRowRulefor clarity. This only affects checks defined using classes. - Renamed
DQColSetRuletoDQRowRuleForEachColfor clarity. This only affects checks defined using classes. - Renamed
row_checksmodule containing check functions tocheck_funcs. This only affects checks defined using classes.
The documentation has been updated accordingly: https://databrickslabs.github.io/dqx/docs/reference/quality_rules/
Full Changelog: v0.4.0...v0.5.0