Skip to content

Commit ee802c4

Browse files
authored
Release v0.9.1 (#546)
The release replaces v0.9.0, which were missing PyPI package and will be removed. ## 0.9.1 * Added quality checker and end to end workflows ([#519](#519)). This release introduces no-code solution for applying checks. The following workflows were added: quality-checker (apply checks and save results to tables) and end-to-end (e2e) workflows (profile input data, generate quality checks, apply the checks, save results to tables). The workflows enable quality checking for data at-rest without the need for code-level integration. It supports reference data for checks using tables (e.g., required by foreign key or compare datasets checks) as well as custom python check functions (mapping of custom check funciton to the module path in the workspace or Unity Catalog volume containing the function definition). The workflows handle one run config for each job run. Future release will introduce functionality to execute this across multiple tables. In addition, CLI commands have been added to execute the workflows. Additionaly, DQX workflows are configured now to execute using serverless clusters, with an option to use standards clusters as well. InstallationChecksStorageHandler now support absolute workspace path locations. * Added built-in row-level check for PII detection ([#486](#486)). Introduced a new built-in check for Personally Identifiable Information (PII) detection, which utilizes the Presidio framework and can be configured using various parameters, such as NLP entity recognition configuration. This check can be defined using the `does_not_contain_pii` check function and can be customized to suit specific use cases. The check requires `pii` extras to be installed: `pip install databricks-labs-dqx[pii]`. Furthermore, a new enum class `NLPEngineConfig` has been introduced to define various NLP engine configurations for PII detection. Overall, these updates aim to provide more robust and customizable quality checking capabilities for detecting PII data. * Added equality row-level checks ([#535](#535)). Two new row-level checks, `is_equal_to` and `is_not_equal_to`, have been introduced to enable equality checks on column values, allowing users to verify whether the values in a specified column are equal to or not equal to a given value, which can be a numeric literal, column expression, string literal, date literal, or timestamp literal. * Added demo for Spark Structured Streaming ([#518](#518)). Added demo to showcase usage of DQX with Spark Structured Streaming for in-transit data quality checking. The demo is available as Databricks notebook, and can be run on any Databricks workspace. * Added clarification to profiler summary statistics ([#523](#523)). Added new section on understanding summary statistics, which explains how these statistics are computed on a sampled subset of the data and provides a reference for the various summary statistics fields. * Fixed rounding datetimes in the checks generator ([#517](#517)). The generator has been enhanced to correctly handle midnight values when rounding "up", ensuring that datetime values already at midnight remain unchanged, whereas previously they were rounded to the next day. * Added API Docs ([#520](#520)). The DQX API documentation is generated automatically using docstrings. As part of this change the library's documentation has been updated to follow Google style. * Improved test automation by adding end-to-end test for the asset bundles demo ([#533](#533)). BREAKING CHANGES! * `ExtraParams` was moved from `databricks.labs.dqx.rule` module to `databricks.labs.dqx.config`
1 parent 806fca5 commit ee802c4

File tree

5 files changed

+15
-16
lines changed

5 files changed

+15
-16
lines changed

CHANGELOG.md

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Version changelog
22

3-
## 0.9.0
3+
## 0.9.1
44

55
* Added quality checker and end to end workflows ([#519](https://github.com/databrickslabs/dqx/issues/519)). This release introduces no-code solution for applying checks. The following workflows were added: quality-checker (apply checks and save results to tables) and end-to-end (e2e) workflows (profile input data, generate quality checks, apply the checks, save results to tables). The workflows enable quality checking for data at-rest without the need for code-level integration. It supports reference data for checks using tables (e.g., required by foreign key or compare datasets checks) as well as custom python check functions (mapping of custom check funciton to the module path in the workspace or Unity Catalog volume containing the function definition). The workflows handle one run config for each job run. Future release will introduce functionality to execute this across multiple tables. In addition, CLI commands have been added to execute the workflows. Additionaly, DQX workflows are configured now to execute using serverless clusters, with an option to use standards clusters as well. InstallationChecksStorageHandler now support absolute workspace path locations.
66
* Added built-in row-level check for PII detection ([#486](https://github.com/databrickslabs/dqx/issues/486)). Introduced a new built-in check for Personally Identifiable Information (PII) detection, which utilizes the Presidio framework and can be configured using various parameters, such as NLP entity recognition configuration. This check can be defined using the `does_not_contain_pii` check function and can be customized to suit specific use cases. The check requires `pii` extras to be installed: `pip install databricks-labs-dqx[pii]`. Furthermore, a new enum class `NLPEngineConfig` has been introduced to define various NLP engine configurations for PII detection. Overall, these updates aim to provide more robust and customizable quality checking capabilities for detecting PII data.
@@ -15,7 +15,6 @@ BREAKING CHANGES!
1515

1616
* `ExtraParams` was moved from `databricks.labs.dqx.rule` module to `databricks.labs.dqx.config`
1717

18-
1918
## 0.8.0
2019

2120
* Added new row-level freshness check ([#495](https://github.com/databrickslabs/dqx/issues/495)). A new data quality check function, `is_data_fresh`, has been introduced to identify stale data resulting from delayed pipelines, enabling early detection of upstream issues. This function assesses whether the values in a specified timestamp column are within a specified number of minutes from a base timestamp column. The function takes three parameters: the column to check, the maximum age in minutes before data is considered stale, and an optional base timestamp column, defaulting to the current timestamp if not provided.

docs/dqx/docs/demos.mdx

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -8,20 +8,20 @@ import Admonition from '@theme/Admonition';
88

99
Import the following notebooks in the Databricks workspace to try DQX out:
1010
## Use as Library
11-
* [DQX Quick Start Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.0/demos/dqx_quick_start_demo_library.py) - quickstart on how to use DQX as a library.
12-
* [DQX Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.0/demos/dqx_demo_library.py) - demonstrates how to use DQX as a library.
13-
* [DQX Demo Notebook for Spark Structured Streaming (Native End-to-End Approach)](https://github.com/databrickslabs/dqx/blob/v0.9.0/demos/dqx_streaming_demo_native.py) - demonstrates how to use DQX as a library with Spark Structured Streaming, using the built-in end-to-end method to handle both reading and writing.
14-
* [DQX Demo Notebook for Spark Structured Streaming (DIY Approach)](https://github.com/databrickslabs/dqx/blob/v0.9.0/demos/dqx_streaming_demo_diy.py) - demonstrates how to use DQX as a library with Spark Structured Streaming, while handling reading and writing on your own outside DQX using Spark API.
15-
* [DQX Demo Notebook for Lakeflow Pipelines (formerly DLT)](https://github.com/databrickslabs/dqx/blob/v0.9.0/demos/dqx_dlt_demo.py) - demonstrates how to use DQX as a library with Lakeflow Pipelines.
16-
* [DQX Asset Bundles Demo](https://github.com/databrickslabs/dqx/blob/v0.9.0/demos/dqx_demo_asset_bundle/README.md) - demonstrates how to use DQX as a library with Databricks Asset Bundles.
17-
* [DQX Demo for dbt](https://github.com/databrickslabs/dqx/blob/v0.9.0/demos/dqx_demo_dbt/README.md) - demonstrates how to use DQX as a library with dbt projects.
11+
* [DQX Quick Start Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.1/demos/dqx_quick_start_demo_library.py) - quickstart on how to use DQX as a library.
12+
* [DQX Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.1/demos/dqx_demo_library.py) - demonstrates how to use DQX as a library.
13+
* [DQX Demo Notebook for Spark Structured Streaming (Native End-to-End Approach)](https://github.com/databrickslabs/dqx/blob/v0.9.1/demos/dqx_streaming_demo_native.py) - demonstrates how to use DQX as a library with Spark Structured Streaming, using the built-in end-to-end method to handle both reading and writing.
14+
* [DQX Demo Notebook for Spark Structured Streaming (DIY Approach)](https://github.com/databrickslabs/dqx/blob/v0.9.1/demos/dqx_streaming_demo_diy.py) - demonstrates how to use DQX as a library with Spark Structured Streaming, while handling reading and writing on your own outside DQX using Spark API.
15+
* [DQX Demo Notebook for Lakeflow Pipelines (formerly DLT)](https://github.com/databrickslabs/dqx/blob/v0.9.1/demos/dqx_dlt_demo.py) - demonstrates how to use DQX as a library with Lakeflow Pipelines.
16+
* [DQX Asset Bundles Demo](https://github.com/databrickslabs/dqx/blob/v0.9.1/demos/dqx_demo_asset_bundle/README.md) - demonstrates how to use DQX as a library with Databricks Asset Bundles.
17+
* [DQX Demo for dbt](https://github.com/databrickslabs/dqx/blob/v0.9.1/demos/dqx_demo_dbt/README.md) - demonstrates how to use DQX as a library with dbt projects.
1818

1919
## Deploy as Workspace Tool
20-
* [DQX Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.0/demos/dqx_demo_tool.py) - demonstrates how to use DQX as a tool when installed in the workspace.
20+
* [DQX Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.1/demos/dqx_demo_tool.py) - demonstrates how to use DQX as a tool when installed in the workspace.
2121

2222
## Use Cases
23-
* [DQX for PII Detection Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.0/demos/dqx_demo_pii_detection.py) - demonstrates how to use DQX to check data for Personally Identifiable Information (PII).
24-
* [DQX for Manufacturing Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.0/demos/dqx_manufacturing_demo.py) - demonstrates how to use DQX to check data quality for Manufacturing Industry datasets.
23+
* [DQX for PII Detection Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.1/demos/dqx_demo_pii_detection.py) - demonstrates how to use DQX to check data for Personally Identifiable Information (PII).
24+
* [DQX for Manufacturing Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.1/demos/dqx_manufacturing_demo.py) - demonstrates how to use DQX to check data quality for Manufacturing Industry datasets.
2525

2626
<br />
2727
<Admonition type="tip" title="Execution Environment">

docs/dqx/docs/guide/quality_checks_apply.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -557,7 +557,7 @@ Below is a sample output of a check stored in a result column (error or warning)
557557
```
558558

559559
The structure of the result columns is an array of struct containing the following fields
560-
(see the exact structure [here](https://github.com/databrickslabs/dqx/blob/v0.9.0/src/databricks/labs/dqx/schema/dq_result_schema.py)):
560+
(see the exact structure [here](https://github.com/databrickslabs/dqx/blob/v0.9.1/src/databricks/labs/dqx/schema/dq_result_schema.py)):
561561
- `name`: name of the check (string type).
562562
- `message`: message describing the quality issue (string type).
563563
- `columns`: name of the column(s) where the quality issue was found (string type).

docs/dqx/docs/reference/quality_checks.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ import TabItem from '@theme/TabItem';
1111
This page provides a reference for the quality check functions available in DQX.
1212
The terms `checks` and `rules` are used interchangeably in the documentation, as they are synonymous in DQX.
1313

14-
You can explore the implementation details of the check functions [here](https://github.com/databrickslabs/dqx/blob/v0.9.0/src/databricks/labs/dqx/check_funcs.py).
14+
You can explore the implementation details of the check functions [here](https://github.com/databrickslabs/dqx/blob/v0.9.1/src/databricks/labs/dqx/check_funcs.py).
1515

1616
## Row-level checks reference
1717

@@ -2723,7 +2723,7 @@ The built-in check supports several configurable parameters:
27232723
DQX automatically installs the built-in entity recognition models at runtime if they are not already available.
27242724
However, for better performance and to avoid potential out-of-memory issues, it is recommended to pre-install models using pip install.
27252725
Any additional models used in a custom configuration must also be installed on your Databricks cluster.
2726-
See the [Using DQX for PII Detection](https://github.com/databrickslabs/dqx/blob/v0.9.0/demos/dqx_demo_pii_detection.py) notebook for examples of custom model installation.
2726+
See the [Using DQX for PII Detection](https://github.com/databrickslabs/dqx/blob/v0.9.1/demos/dqx_demo_pii_detection.py) notebook for examples of custom model installation.
27272727
</Admonition>
27282728

27292729
#### Using Built-in NLP Engine Configurations
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.9.0"
1+
__version__ = "0.9.1"

0 commit comments

Comments
 (0)