v0.9.1
0.9.1
- Added quality checker and end to end workflows (#519). This release introduces no-code solution for applying checks. The following workflows were added: quality-checker (apply checks and save results to tables) and end-to-end (e2e) workflows (profile input data, generate quality checks, apply the checks, save results to tables). The workflows enable quality checking for data at-rest without the need for code-level integration. It supports reference data for checks using tables (e.g., required by foreign key or compare datasets checks) as well as custom python check functions (mapping of custom check funciton to the module path in the workspace or Unity Catalog volume containing the function definition). The workflows handle one run config for each job run. Future release will introduce functionality to execute this across multiple tables. In addition, CLI commands have been added to execute the workflows. Additionaly, DQX workflows are configured now to execute using serverless clusters, with an option to use standards clusters as well. InstallationChecksStorageHandler now support absolute workspace path locations.
- Added built-in row-level check for PII detection (#486). Introduced a new built-in check for Personally Identifiable Information (PII) detection, which utilizes the Presidio framework and can be configured using various parameters, such as NLP entity recognition configuration. This check can be defined using the
does_not_contain_piicheck function and can be customized to suit specific use cases. The check requirespiiextras to be installed:pip install databricks-labs-dqx[pii]. Furthermore, a new enum classNLPEngineConfighas been introduced to define various NLP engine configurations for PII detection. Overall, these updates aim to provide more robust and customizable quality checking capabilities for detecting PII data. - Added equality row-level checks (#535). Two new row-level checks,
is_equal_toandis_not_equal_to, have been introduced to enable equality checks on column values, allowing users to verify whether the values in a specified column are equal to or not equal to a given value, which can be a numeric literal, column expression, string literal, date literal, or timestamp literal. - Added demo for Spark Structured Streaming (#518). Added demo to showcase usage of DQX with Spark Structured Streaming for in-transit data quality checking. The demo is available as Databricks notebook, and can be run on any Databricks workspace.
- Added clarification to profiler summary statistics (#523). Added new section on understanding summary statistics, which explains how these statistics are computed on a sampled subset of the data and provides a reference for the various summary statistics fields.
- Fixed rounding datetimes in the checks generator (#517). The generator has been enhanced to correctly handle midnight values when rounding "up", ensuring that datetime values already at midnight remain unchanged, whereas previously they were rounded to the next day.
- Added API Docs (#520). The DQX API documentation is generated automatically using docstrings. As part of this change the library's documentation has been updated to follow Google style.
- Improved test automation by adding end-to-end test for the asset bundles demo (#533).
BREAKING CHANGES!
ExtraParamswas moved fromdatabricks.labs.dqx.rulemodule todatabricks.labs.dqx.config
Contributors: @mwojtyczka @ghanse @renardeinside @cornzyblack @bsr-the-mngrm @dinbab1984 @AdityaMandiwal