Skip to content

Commit 6432e45

Browse files
authored
Release v0.10.0 (#920)
* Added Data Quality Summary Metrics ([#553](#553)). The data quality engine has been enhanced with the ability to track and manage summary metrics for data quality validation, leveraging Spark's Observation feature. A new `DQMetricsObserver` class has been introduced to manage Spark observations and track summary metrics on datasets checked with the engine. The `DQEngine` class has been updated to optionally return the Spark observation associated with a given run, allowing users to access and save summary metrics. The engine now supports also writing summary metrics to a table using the `metrics_config` parameter, and a new `save_summary_metrics` method has been added to save data quality summary metrics to a table. Additionally, the engine has been updated to include a unique `run_id` field in the detailed per-row quality results, enabling cross-referencing with summary metrics. The changes also include updates to the configuration file to support the storage of summary metrics. Overall, these enhancements provide a more comprehensive and flexible data quality checking capability, allowing users to track and analyze data quality issues more effectively. * LLM assisted rules generation ([#577](#577)). This release introduces a significant enhancement to the data quality rules generation process with the integration of AI-assisted rules generation using large language models (LLMs). The `DQGenerator` class now includes a `generate_dq_rules_ai_assisted` method, which takes user input in natural language and optionally a schema from an input table to generate data quality rules. These rules are then validated for correctness. The AI-assisted rules generation feature supports both programmatic and no-code approaches. Additionally, the feature enables the use of different LLM models and gives the possibility to use custom check functions. The release also includes various updates to the documentation, configuration files, and testing framework to support the new AI-assisted rules generation feature, ensuring a more streamlined and efficient process for defining and applying data quality rules. * Added Lakebase checks storage backend ([#550](#550)). A Lakebase checks storage backend was added, allowing users to store and manage their data quality rules in a centralized lakabase table, in addition to the existing Delta table storage. The `checks_location` resolution has been updated to accommodate Lakebase, supporting both table and file storage, with flexible formatting options, including "catalog.schema.table" and "database.schema.table". The Lakebase checks storage backend is configurable through the `LakebaseChecksStorageConfig` class, which includes fields for instance name, user, location, port, run configuration name, and write mode. This update provides users with more flexibility in storing and loading quality checks, ensuring that checks are saved correctly regardless of the specified location format. * Added runtime validation of sql expressions ([#625](#625)). The data quality check functionality has been enhanced with runtime validation of SQL expressions, ensuring that specified fields can be resolved in the input DataFrame and that SQL expressions are valid before evaluation. If an SQL expression is invalid, the check evaluation is skipped and the results include a check failure with a descriptive message. Additionally, the configuration validation for Unity Catalog volume file paths has been improved to enforce a specific format, preventing invalid configurations and providing more informative error messages. * Fixed docs ([#598](#598)). The documentation build process has undergone significant improvements to enhance efficiency and maintainability. * Improved Config Serialization ([#676](#676)). Several updates have been made to improve the functionality, consistency, and maintainability of the codebase. The configuration loading functionality has been refactored to utilize the `ConfigSerializer` class, which handles the serialization and deserialization of workspace and run configurations. * Restore use of `hatch-fancy-pypi-readme` to fix images in PyPi ([#601](#601)). The image source path for the logo in the README has been modified to correctly display the logo image when rendered, particularly on PyPi. * Skip check evaluation if columns or filter cannot be resolved in the input DataFrame ([#609](#609)). DQX now skip check evaluation if columns or filters are incorrect allowing other checks to proceed even if one rule fails. The DQX engine validates specified column, columns and filter fields against the input DataFrame before applying checks, skipping evaluation and providing informative error messages if any fields are invalid. * Updated user guide docs ([#607](#607)). The documentation for quality checking and integration options has been updated to provide accurate and detailed information on supported types and approaches. Quality checking can be performed in-transit (pre-commit), validating data on the fly during processing, or at-rest, checking existing data stored in tables. * Improved build process ([#618](#618)). The hatch version has been updated to 1.15.0 to avoid compatibility issues with click version 8.3 and later, which introduced a bug affecting hatch. Additionally, the project's dependencies have been updated, including bumping the `databricks-labs-pytester` version from 0.7.2 to 0.7.4, and code refactoring has been done to use a single Lakebase instance for all integration tests, with retry logic added to handle cases where the workspace quota limit for the number of Lakebase instances is exceeded, enhancing the testing infrastructure and improving test reliability. Furthermore, documentation updates have been made to clarify the application of quality checks to data using DQX. These changes aim to improve the efficiency, reliability, and clarity of the project's testing and documentation infrastructure.
1 parent 07d695a commit 6432e45

File tree

7 files changed

+35
-17
lines changed

7 files changed

+35
-17
lines changed

CHANGELOG.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,18 @@
11
# Version changelog
22

3+
## 0.10.0
4+
5+
* Added Data Quality Summary Metrics ([#553](https://github.com/databrickslabs/dqx/issues/553)). The data quality engine has been enhanced with the ability to track and manage summary metrics for data quality validation, leveraging Spark's Observation feature. A new `DQMetricsObserver` class has been introduced to manage Spark observations and track summary metrics on datasets checked with the engine. The `DQEngine` class has been updated to optionally return the Spark observation associated with a given run, allowing users to access and save summary metrics. The engine now supports also writing summary metrics to a table using the `metrics_config` parameter, and a new `save_summary_metrics` method has been added to save data quality summary metrics to a table. Additionally, the engine has been updated to include a unique `run_id` field in the detailed per-row quality results, enabling cross-referencing with summary metrics. The changes also include updates to the configuration file to support the storage of summary metrics. Overall, these enhancements provide a more comprehensive and flexible data quality checking capability, allowing users to track and analyze data quality issues more effectively.
6+
* LLM assisted rules generation ([#577](https://github.com/databrickslabs/dqx/issues/577)). This release introduces a significant enhancement to the data quality rules generation process with the integration of AI-assisted rules generation using large language models (LLMs). The `DQGenerator` class now includes a `generate_dq_rules_ai_assisted` method, which takes user input in natural language and optionally a schema from an input table to generate data quality rules. These rules are then validated for correctness. The AI-assisted rules generation feature supports both programmatic and no-code approaches. Additionally, the feature enables the use of different LLM models and gives the possibility to use custom check functions. The release also includes various updates to the documentation, configuration files, and testing framework to support the new AI-assisted rules generation feature, ensuring a more streamlined and efficient process for defining and applying data quality rules.
7+
* Added Lakebase checks storage backend ([#550](https://github.com/databrickslabs/dqx/issues/550)). A Lakebase checks storage backend was added, allowing users to store and manage their data quality rules in a centralized lakabase table, in addition to the existing Delta table storage. The `checks_location` resolution has been updated to accommodate Lakebase, supporting both table and file storage, with flexible formatting options, including "catalog.schema.table" and "database.schema.table". The Lakebase checks storage backend is configurable through the `LakebaseChecksStorageConfig` class, which includes fields for instance name, user, location, port, run configuration name, and write mode. This update provides users with more flexibility in storing and loading quality checks, ensuring that checks are saved correctly regardless of the specified location format.
8+
* Added runtime validation of sql expressions ([#625](https://github.com/databrickslabs/dqx/issues/625)). The data quality check functionality has been enhanced with runtime validation of SQL expressions, ensuring that specified fields can be resolved in the input DataFrame and that SQL expressions are valid before evaluation. If an SQL expression is invalid, the check evaluation is skipped and the results include a check failure with a descriptive message. Additionally, the configuration validation for Unity Catalog volume file paths has been improved to enforce a specific format, preventing invalid configurations and providing more informative error messages.
9+
* Fixed docs ([#598](https://github.com/databrickslabs/dqx/issues/598)). The documentation build process has undergone significant improvements to enhance efficiency and maintainability.
10+
* Improved Config Serialization ([#676](https://github.com/databrickslabs/dqx/issues/676)). Several updates have been made to improve the functionality, consistency, and maintainability of the codebase. The configuration loading functionality has been refactored to utilize the `ConfigSerializer` class, which handles the serialization and deserialization of workspace and run configurations.
11+
* Restore use of `hatch-fancy-pypi-readme` to fix images in PyPi ([#601](https://github.com/databrickslabs/dqx/issues/601)). The image source path for the logo in the README has been modified to correctly display the logo image when rendered, particularly on PyPi.
12+
* Skip check evaluation if columns or filter cannot be resolved in the input DataFrame ([#609](https://github.com/databrickslabs/dqx/issues/609)). DQX now skip check evaluation if columns or filters are incorrect allowing other checks to proceed even if one rule fails. The DQX engine validates specified column, columns and filter fields against the input DataFrame before applying checks, skipping evaluation and providing informative error messages if any fields are invalid.
13+
* Updated user guide docs ([#607](https://github.com/databrickslabs/dqx/issues/607)). The documentation for quality checking and integration options has been updated to provide accurate and detailed information on supported types and approaches. Quality checking can be performed in-transit (pre-commit), validating data on the fly during processing, or at-rest, checking existing data stored in tables.
14+
* Improved build process ([#618](https://github.com/databrickslabs/dqx/issues/618)). The hatch version has been updated to 1.15.0 to avoid compatibility issues with click version 8.3 and later, which introduced a bug affecting hatch. Additionally, the project's dependencies have been updated, including bumping the `databricks-labs-pytester` version from 0.7.2 to 0.7.4, and code refactoring has been done to use a single Lakebase instance for all integration tests, with retry logic added to handle cases where the workspace quota limit for the number of Lakebase instances is exceeded, enhancing the testing infrastructure and improving test reliability. Furthermore, documentation updates have been made to clarify the application of quality checks to data using DQX. These changes aim to improve the efficiency, reliability, and clarity of the project's testing and documentation infrastructure.
15+
316
## 0.9.3
417

518
* Added support for running checks on multiple tables ([#566](https://github.com/databrickslabs/dqx/issues/566)). Added more flexibility and functionality in running data quality checks, allowing users to run checks on multiple tables in a single method call and as part of Workflows execution. Provided options to run checks for all configured run configs or for a specific run config, or for tables/views matching wildcard patterns. The CLI commands for running workflows have been updated to reflect and support these new functionalities. Additionally, new parameters have been added to configuration file to control the level of parallelism for these operations, such as `profiler_max_parallelism` and `quality_checker_max_parallelism`. A new demo has been added to showcases how to use the profiler and apply checks across multiple tables. The changes aim to improve scalability of DQX.

docs/dqx/docs/demos.mdx

Lines changed: 12 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -8,22 +8,22 @@ import Admonition from '@theme/Admonition';
88

99
Import the following notebooks in the Databricks workspace to try DQX out:
1010
## Use as Library
11-
* [DQX Quick Start Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_quick_start_demo_library.py) - quickstart on how to use DQX as a library.
12-
* [DQX Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_demo_library.py) - demonstrates how to use DQX as a library.
13-
* [DQX Demo Notebook for Data Quality Summary Metrics](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_demo_summary_metrics.py) - demonstrates how to generate summary-level data quality metrics when validating data with DQX.
14-
* [DQX Demo Notebook for Profiling and Applying Checks at Scale on Multiple Tables](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_multi_table_demo.py) - demonstrates how to use DQX as a library at scale to apply checks on multiple tables.
15-
* [DQX Demo Notebook for Spark Structured Streaming (Native End-to-End Approach)](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_streaming_demo_native.py) - demonstrates how to use DQX as a library with Spark Structured Streaming, using the built-in end-to-end method to handle both reading and writing.
16-
* [DQX Demo Notebook for Spark Structured Streaming (DIY Approach)](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_streaming_demo_diy.py) - demonstrates how to use DQX as a library with Spark Structured Streaming, while handling reading and writing on your own outside DQX using Spark API.
17-
* [DQX Demo Notebook for Lakeflow Pipelines (formerly DLT)](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_dlt_demo.py) - demonstrates how to use DQX as a library with Lakeflow Pipelines.
18-
* [DQX Asset Bundles Demo](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_demo_asset_bundle/README.md) - demonstrates how to use DQX as a library with Databricks Asset Bundles.
19-
* [DQX Demo for dbt](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_demo_dbt/README.md) - demonstrates how to use DQX as a library with dbt projects.
11+
* [DQX Quick Start Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.10.0/demos/dqx_quick_start_demo_library.py) - quickstart on how to use DQX as a library.
12+
* [DQX Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.10.0/demos/dqx_demo_library.py) - demonstrates how to use DQX as a library.
13+
* [DQX Demo Notebook for Data Quality Summary Metrics](https://github.com/databrickslabs/dqx/blob/v0.10.0/demos/dqx_demo_summary_metrics.py) - demonstrates how to generate summary-level data quality metrics when validating data with DQX.
14+
* [DQX Demo Notebook for Profiling and Applying Checks at Scale on Multiple Tables](https://github.com/databrickslabs/dqx/blob/v0.10.0/demos/dqx_multi_table_demo.py) - demonstrates how to use DQX as a library at scale to apply checks on multiple tables.
15+
* [DQX Demo Notebook for Spark Structured Streaming (Native End-to-End Approach)](https://github.com/databrickslabs/dqx/blob/v0.10.0/demos/dqx_streaming_demo_native.py) - demonstrates how to use DQX as a library with Spark Structured Streaming, using the built-in end-to-end method to handle both reading and writing.
16+
* [DQX Demo Notebook for Spark Structured Streaming (DIY Approach)](https://github.com/databrickslabs/dqx/blob/v0.10.0/demos/dqx_streaming_demo_diy.py) - demonstrates how to use DQX as a library with Spark Structured Streaming, while handling reading and writing on your own outside DQX using Spark API.
17+
* [DQX Demo Notebook for Lakeflow Pipelines (formerly DLT)](https://github.com/databrickslabs/dqx/blob/v0.10.0/demos/dqx_dlt_demo.py) - demonstrates how to use DQX as a library with Lakeflow Pipelines.
18+
* [DQX Asset Bundles Demo](https://github.com/databrickslabs/dqx/blob/v0.10.0/demos/dqx_demo_asset_bundle/README.md) - demonstrates how to use DQX as a library with Databricks Asset Bundles.
19+
* [DQX Demo for dbt](https://github.com/databrickslabs/dqx/blob/v0.10.0/demos/dqx_demo_dbt/README.md) - demonstrates how to use DQX as a library with dbt projects.
2020

2121
## Deploy as Workspace Tool
22-
* [DQX Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_demo_tool.py) - demonstrates how to use DQX as a tool when installed in the workspace.
22+
* [DQX Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.10.0/demos/dqx_demo_tool.py) - demonstrates how to use DQX as a tool when installed in the workspace.
2323

2424
## Use Cases
25-
* [DQX for PII Detection Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_demo_pii_detection.py) - demonstrates how to use DQX to check data for Personally Identifiable Information (PII).
26-
* [DQX for Manufacturing Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_manufacturing_demo.py) - demonstrates how to use DQX to check data quality for Manufacturing Industry datasets.
25+
* [DQX for PII Detection Notebook](https://github.com/databrickslabs/dqx/blob/v0.10.0/demos/dqx_demo_pii_detection.py) - demonstrates how to use DQX to check data for Personally Identifiable Information (PII).
26+
* [DQX for Manufacturing Notebook](https://github.com/databrickslabs/dqx/blob/v0.10.0/demos/dqx_manufacturing_demo.py) - demonstrates how to use DQX to check data quality for Manufacturing Industry datasets.
2727

2828
<br />
2929
<Admonition type="tip" title="Execution Environment">

docs/dqx/docs/dev/contributing.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ In all cases, the full test suite is executed on code merged into `main` branch
114114

115115
This section provides a step-by-step guide for setting up your local environment and dependencies for efficient development.
116116

117-
To start with, install the required python version on your computer (see `python=<version>` field in [pyproject.toml](https://github.com/databrickslabs/dqx/blob/v0.9.3/pyproject.toml)).
117+
To start with, install the required python version on your computer (see `python=<version>` field in [pyproject.toml](https://github.com/databrickslabs/dqx/blob/v0.10.0/pyproject.toml)).
118118
On MacOSX, run:
119119
```
120120
# double check the required version in pyproject.toml

docs/dqx/docs/guide/quality_checks_apply.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -892,7 +892,7 @@ Below is a sample output of a check stored in a result column (error or warning)
892892
```
893893

894894
The structure of the result columns is an array of struct containing the following fields
895-
(see the exact structure [here](https://github.com/databrickslabs/dqx/blob/v0.9.3/src/databricks/labs/dqx/schema/dq_result_schema.py)):
895+
(see the exact structure [here](https://github.com/databrickslabs/dqx/blob/v0.10.0/src/databricks/labs/dqx/schema/dq_result_schema.py)):
896896
- `name`: name of the check (string type).
897897
- `message`: message describing the quality issue (string type).
898898
- `columns`: name of the column(s) where the quality issue was found (string type).

docs/dqx/docs/reference/quality_checks.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ DQX provides a collection of predefined built-in quality rules (checks). Additio
1616
All DQX rules are evaluated independently and in parallel. If one rule fails, the others are still processed.
1717
All rule types, including row-level and dataset-level rules, can be defined and applied simultaneously in a single execution.
1818

19-
You can explore the implementation details of the check functions [here](https://github.com/databrickslabs/dqx/blob/v0.9.3/src/databricks/labs/dqx/check_funcs.py).
19+
You can explore the implementation details of the check functions [here](https://github.com/databrickslabs/dqx/blob/v0.10.0/src/databricks/labs/dqx/check_funcs.py).
2020

2121
## Row-level checks reference
2222

@@ -3077,7 +3077,7 @@ The built-in check supports several configurable parameters:
30773077
DQX automatically installs the built-in entity recognition models at runtime if they are not already available.
30783078
However, for better performance and to avoid potential out-of-memory issues, it is recommended to pre-install models using pip install.
30793079
Any additional models used in a custom configuration must also be installed on your Databricks cluster.
3080-
See the [Using DQX for PII Detection](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_demo_pii_detection.py) notebook for examples of custom model installation.
3080+
See the [Using DQX for PII Detection](https://github.com/databrickslabs/dqx/blob/v0.10.0/demos/dqx_demo_pii_detection.py) notebook for examples of custom model installation.
30813081
</Admonition>
30823082

30833083
#### Using Built-in NLP Engine Configurations

pyproject.toml

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,11 @@ description = 'Data Quality eXtended (DQX) is a Python library for data quality
66
license-files = { paths = ["LICENSE", "NOTICE"] }
77
requires-python = ">=3.10"
88
keywords = ["Databricks"]
9+
authors = [
10+
{name = "Databricks Labs", email = "[email protected]"},
11+
{ name = "Marcin Wojtyczka", email = "[email protected]" },
12+
{ name = "Alex Ott", email = "[email protected]" },
13+
]
914
maintainers = [
1015
{ name = "Databricks Labs", email = "[email protected]" },
1116
{ name = "Marcin Wojtyczka", email = "[email protected]" },
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.9.3"
1+
__version__ = "0.10.0"

0 commit comments

Comments
 (0)