Skip to content

v0.5.0

Choose a tag to compare

@github-actions github-actions released this 06 Jun 17:08
· 72 commits to main since this release
fcb37f9

What's Changed

  • Fix spark remote version detection in CI by @mwojtyczka in #342
  • Fix spark remote installation by @mwojtyczka in #346
  • Load and save checks from a Delta table by @ghanse in #339
  • Handle nulls in uniqueness check for composite keys to conform with the SQL ANSI Standard by @mwojtyczka in #345

Corrected is_unique check to consistently handle nulls. To conform with the SQL ANSI Standard), null values are treated now as unknown, thus not duplicates, e.g. "(NULL, NULL) not equals (NULL, NULL); (1, NULL) not equals (1, NULL). A new parameter called nulls_distinct has been added to the check which is set to True by default. If set to False, null values are treated as duplicates, e.g. eg. (1, NULL) equals (1, NULL) and (NULL, NULL) equals (NULL, NULL)
Added columns parameter to the is_unique check to be able to pass list of columns to handle composite keys.

  • Allow user metadata for individual checks by @ghanse in #352
  • Add save_results_in_table function by @karthik-ballullaya-db in #319
  • Fix older than by @ghanse in #354
  • Add PII-detection example by @ghanse in #358
  • Add aggregation type of checks (count/min/max/sum/avg/mean over a group of records) by @mwojtyczka in #357
  • Updated documentation to provide guidance on applying quality checks to multiple data sets.

BREAKING CHANGES AND HOW TO UPGRADE!

This release introduces changes to the syntax of quality rules to ensure we can scale DQX functionality in the future. Please update your quality rules accordingly.

  • Renamed col_name field in the checks to column to better reflect the fact that it can be a column name or column expression.

From:

- check:
    function: is_not_null
    arguments:
      col_name: col1

To:

- check:
    function: is_not_null
    arguments:
      column: col1
  • Similarly, custom checks must define column instead of col_name as first argument.
  • Renamed and changed data type of col_name to columns in the reporting columns. DQX stores list of columns instead of a single column now to handle checks that take multiple columns as input.
  • Renamed col_names to for_each_column and moved it from check function arguments the check level.

From:

- check:
    function: is_not_null
    arguments:
      col_names:
      - col1
      - col2

To:

- check:
    function: is_not_null
    for_each_column:
    - col1
    - col2
  • Renamed DQColRule to DQRowRule for clarity. This only affects checks defined using classes.
  • Renamed DQColSetRule to DQRowRuleForEachCol for clarity. This only affects checks defined using classes.
  • Renamed row_checks module containing check functions to check_funcs. This only affects checks defined using classes.

The documentation has been updated accordingly: https://databrickslabs.github.io/dqx/docs/reference/quality_rules/

Full Changelog: v0.4.0...v0.5.0