Skip to content

Commit d9d47a9

Browse files
authored
Release v0.9.3 (#597)
* Added support for running checks on multiple tables ([#566](#566)). Added more flexibility and functionality in running data quality checks, allowing users to run checks on multiple tables in a single method call and as part of Workflows execution. Provided options to run checks for all configured run configs or for a specific run config, or for tables/views matching wildcard patterns. The CLI commands for running workflows have been updated to reflect and support these new functionalities. Additionally, new parameters have been added to configuration file to control the level of parallelism for these operations, such as `profiler_max_parallelism` and `quality_checker_max_parallelism`. A new demo has been added to showcases how to use the profiler and apply checks across multiple tables. The changes aim to improve scalability of DQX. * Added New Row-level Checks: IPv6 Address Validation ([#578](#578)). DQX now includes 2 new row-level checks: validation of IPv6 address (`is_valid_ipv6_address` check function), and validation if IPv6 address is within provided CIDR block (`is_ipv6_address_in_cidr` check function). * Added New Dataset-level Check: Schema Validation check ([#568](#568)). The `has_valid_schema` check function has been introduced to validate whether a DataFrame conforms to a specified schema, with results reported at the row level for consistency with other checks. This function can operate in non-strict mode, where it verifies the existence of expected columns with compatible types, or in strict mode, where it enforces an exact schema match, including column order and types. It accepts parameters such as the expected schema, which can be defined as a DDL string or a StructType object, and optional arguments to specify columns to validate and strict mode. * Added New Row-level Checks: Spatial data validations ([#581](#581)). Specialized data validation checks for geospatial data have been introduced, enabling verification of valid latitude and longitude values, various geometry and geography types, such as points, linestrings, polygons, multipoints, multilinestrings, and multipolygons, as well as checks for Open Geospatial Consortium (OGC) validity, non-empty geometries, and specific dimensions or coordinate ranges. These checks are implemented as check functions, including `is_latitude`, `is_longitude`, `is_geometry`, `is_geography`, `is_point`, `is_linestring`, `is_polygon`, `is_multipoint`, `is_multilinestring`, `is_multipolygon`, `is_ogc_valid`, `is_non_empty_geometry`, `has_dimension`, `has_x_coordinate_between`, and `has_y_coordinate_between`. The addition of these geospatial data validation checks enhances the overall data quality capabilities, allowing for more accurate and reliable geospatial data processing and analysis. Running these checks requires Databricks serverless or cluster with runtime 17.1 or above. * Added absolute and relative tolerance to comparison of datasets ([#574](#574)). The `compare_datasets` check has been enhanced with the introduction of absolute and relative tolerance parameters, enabling more flexible comparisons of decimal values. These tolerances can be applied to numeric columns. * Added detailed telemetry ([#561](#561)). Telemetry has been enhanced across multiple functionalities to provide better visibility into DQX usage, including which features and checks are used most frequently. This will help us focus development efforts on the areas that matter most to our users. * Allow installation in a custom folder ([#575](#575)). The installation process for the library has been enhanced to offer flexible installation options, allowing users to install the library in a custom workspace folder, in addition to the default user home directory or a global folder. When installing DQX as a workspace tool using the Databricks CLI, users are prompted to optionally specify a custom workspace path for the installation. Allowing custom installation folder makes it possible to use DQX on [group assigned cluster](https://docs.databricks.com/aws/en/compute/group-access). * Profile subset dataframe ([#589](#589)). The data profiling feature has been enhanced to allow users to profile and generate rules on a subset of the input data by introducing a filter option, which is a string SQL expression that can be used to filter the input data. This filter can be specified in the configuration file or when using the profiler, providing more flexibility in analyzing subsets of data. The profiler supports extensive configuration options to customize the profiling process, including sampling, limiting, and computing statistics on the sampled data. The new filter option enables users to generate more targeted and relevant rules, and it can be used to focus on particular segments of the data, such as rows that match certain conditions. * Added custom exceptions ([#582](#582)). The codebase now utilizes custom exceptions to handle various error scenarios, providing more specific and informative error messages compared to generic exceptions. BREAKING CHANGES! * Workflows run by default for all run configs from configuration file. Previously, the default behaviour was to run them for a specific run config only. * The following depreciated methods are removed from the `DQEngine`: `load_checks_from_local_file`, `load_checks_from_workspace_file`, `load_checks_from_table`, `load_checks_from_installation`, `save_checks_in_local_file`, `save_checks_in_workspace_file`, `save_checks_in_table`,, `save_checks_in_installation`, `load_run_config`. For loading and saving checks, users are advised to use `load_checks` and `save_checks` of the `DQEngine` described [here](https://databrickslabs.github.io/dqx/docs/guide/quality_checks_storage/), which support various storage types.
1 parent 8541e95 commit d9d47a9

File tree

6 files changed

+33
-16
lines changed

6 files changed

+33
-16
lines changed

CHANGELOG.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,22 @@
11
# Version changelog
22

3+
## 0.9.3
4+
5+
* Added support for running checks on multiple tables ([#566](https://github.com/databrickslabs/dqx/issues/566)). Added more flexibility and functionality in running data quality checks, allowing users to run checks on multiple tables in a single method call and as part of Workflows execution. Provided options to run checks for all configured run configs or for a specific run config, or for tables/views matching wildcard patterns. The CLI commands for running workflows have been updated to reflect and support these new functionalities. Additionally, new parameters have been added to configuration file to control the level of parallelism for these operations, such as `profiler_max_parallelism` and `quality_checker_max_parallelism`. A new demo has been added to showcases how to use the profiler and apply checks across multiple tables. The changes aim to improve scalability of DQX.
6+
* Added New Row-level Checks: IPv6 Address Validation ([#578](https://github.com/databrickslabs/dqx/issues/578)). DQX now includes 2 new row-level checks: validation of IPv6 address (`is_valid_ipv6_address` check function), and validation if IPv6 address is within provided CIDR block (`is_ipv6_address_in_cidr` check function).
7+
* Added New Dataset-level Check: Schema Validation check ([#568](https://github.com/databrickslabs/dqx/issues/568)). The `has_valid_schema` check function has been introduced to validate whether a DataFrame conforms to a specified schema, with results reported at the row level for consistency with other checks. This function can operate in non-strict mode, where it verifies the existence of expected columns with compatible types, or in strict mode, where it enforces an exact schema match, including column order and types. It accepts parameters such as the expected schema, which can be defined as a DDL string or a StructType object, and optional arguments to specify columns to validate and strict mode.
8+
* Added New Row-level Checks: Spatial data validations ([#581](https://github.com/databrickslabs/dqx/issues/581)). Specialized data validation checks for geospatial data have been introduced, enabling verification of valid latitude and longitude values, various geometry and geography types, such as points, linestrings, polygons, multipoints, multilinestrings, and multipolygons, as well as checks for Open Geospatial Consortium (OGC) validity, non-empty geometries, and specific dimensions or coordinate ranges. These checks are implemented as check functions, including `is_latitude`, `is_longitude`, `is_geometry`, `is_geography`, `is_point`, `is_linestring`, `is_polygon`, `is_multipoint`, `is_multilinestring`, `is_multipolygon`, `is_ogc_valid`, `is_non_empty_geometry`, `has_dimension`, `has_x_coordinate_between`, and `has_y_coordinate_between`. The addition of these geospatial data validation checks enhances the overall data quality capabilities, allowing for more accurate and reliable geospatial data processing and analysis. Running these checks requires Databricks serverless or cluster with runtime 17.1 or above.
9+
* Added absolute and relative tolerance to comparison of datasets ([#574](https://github.com/databrickslabs/dqx/issues/574)). The `compare_datasets` check has been enhanced with the introduction of absolute and relative tolerance parameters, enabling more flexible comparisons of decimal values. These tolerances can be applied to numeric columns.
10+
* Added detailed telemetry ([#561](https://github.com/databrickslabs/dqx/issues/561)). Telemetry has been enhanced across multiple functionalities to provide better visibility into DQX usage, including which features and checks are used most frequently. This will help us focus development efforts on the areas that matter most to our users.
11+
* Allow installation in a custom folder ([#575](https://github.com/databrickslabs/dqx/issues/575)). The installation process for the library has been enhanced to offer flexible installation options, allowing users to install the library in a custom workspace folder, in addition to the default user home directory or a global folder. When installing DQX as a workspace tool using the Databricks CLI, users are prompted to optionally specify a custom workspace path for the installation. Allowing custom installation folder makes it possible to use DQX on [group assigned cluster](https://docs.databricks.com/aws/en/compute/group-access).
12+
* Profile subset dataframe ([#589](https://github.com/databrickslabs/dqx/issues/589)). The data profiling feature has been enhanced to allow users to profile and generate rules on a subset of the input data by introducing a filter option, which is a string SQL expression that can be used to filter the input data. This filter can be specified in the configuration file or when using the profiler, providing more flexibility in analyzing subsets of data. The profiler supports extensive configuration options to customize the profiling process, including sampling, limiting, and computing statistics on the sampled data. The new filter option enables users to generate more targeted and relevant rules, and it can be used to focus on particular segments of the data, such as rows that match certain conditions.
13+
* Added custom exceptions ([#582](https://github.com/databrickslabs/dqx/issues/582)). The codebase now utilizes custom exceptions to handle various error scenarios, providing more specific and informative error messages compared to generic exceptions.
14+
15+
BREAKING CHANGES!
16+
17+
* Workflows run by default for all run configs from configuration file. Previously, the default behaviour was to run them for a specific run config only.
18+
* The following depreciated methods are removed from the `DQEngine`: `load_checks_from_local_file`, `load_checks_from_workspace_file`, `load_checks_from_table`, `load_checks_from_installation`, `save_checks_in_local_file`, `save_checks_in_workspace_file`, `save_checks_in_table`,, `save_checks_in_installation`, `load_run_config`. For loading and saving checks, users are advised to use `load_checks` and `save_checks` of the `DQEngine` described [here](https://databrickslabs.github.io/dqx/docs/guide/quality_checks_storage/), which support various storage types.
19+
320
## 0.9.2
421

522
* Added performance benchmarks ([#548](https://github.com/databrickslabs/dqx/issues/548)). Performance tests are run to ensure performance does not degrade by more than 25% by any change. Benchmark results are published in the documentation in the reference section. The benchmark covers all check functions, running all funcitons at once and applying the same funcitons at once for multiple columns using foreach column. A new performance GitHub workflow has been introduced to automate performance benchmarking, generating a new benchmark baseline, updating the existing baseline, and running performance tests to compare with the baseline.

docs/dqx/docs/demos.mdx

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -8,21 +8,21 @@ import Admonition from '@theme/Admonition';
88

99
Import the following notebooks in the Databricks workspace to try DQX out:
1010
## Use as Library
11-
* [DQX Quick Start Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.2/demos/dqx_quick_start_demo_library.py) - quickstart on how to use DQX as a library.
12-
* [DQX Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.2/demos/dqx_demo_library.py) - demonstrates how to use DQX as a library.
13-
* [DQX Demo Notebook for Profiling and Applying Checks at Scale on Multiple Tables](https://github.com/databrickslabs/dqx/blob/v0.9.2/demos/dqx_multi_table_demo.py) - demonstrates how to use DQX as a library at scale to apply checks on multiple tables.
14-
* [DQX Demo Notebook for Spark Structured Streaming (Native End-to-End Approach)](https://github.com/databrickslabs/dqx/blob/v0.9.2/demos/dqx_streaming_demo_native.py) - demonstrates how to use DQX as a library with Spark Structured Streaming, using the built-in end-to-end method to handle both reading and writing.
15-
* [DQX Demo Notebook for Spark Structured Streaming (DIY Approach)](https://github.com/databrickslabs/dqx/blob/v0.9.2/demos/dqx_streaming_demo_diy.py) - demonstrates how to use DQX as a library with Spark Structured Streaming, while handling reading and writing on your own outside DQX using Spark API.
16-
* [DQX Demo Notebook for Lakeflow Pipelines (formerly DLT)](https://github.com/databrickslabs/dqx/blob/v0.9.2/demos/dqx_dlt_demo.py) - demonstrates how to use DQX as a library with Lakeflow Pipelines.
17-
* [DQX Asset Bundles Demo](https://github.com/databrickslabs/dqx/blob/v0.9.2/demos/dqx_demo_asset_bundle/README.md) - demonstrates how to use DQX as a library with Databricks Asset Bundles.
18-
* [DQX Demo for dbt](https://github.com/databrickslabs/dqx/blob/v0.9.2/demos/dqx_demo_dbt/README.md) - demonstrates how to use DQX as a library with dbt projects.
11+
* [DQX Quick Start Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_quick_start_demo_library.py) - quickstart on how to use DQX as a library.
12+
* [DQX Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_demo_library.py) - demonstrates how to use DQX as a library.
13+
* [DQX Demo Notebook for Profiling and Applying Checks at Scale on Multiple Tables](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_multi_table_demo.py) - demonstrates how to use DQX as a library at scale to apply checks on multiple tables.
14+
* [DQX Demo Notebook for Spark Structured Streaming (Native End-to-End Approach)](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_streaming_demo_native.py) - demonstrates how to use DQX as a library with Spark Structured Streaming, using the built-in end-to-end method to handle both reading and writing.
15+
* [DQX Demo Notebook for Spark Structured Streaming (DIY Approach)](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_streaming_demo_diy.py) - demonstrates how to use DQX as a library with Spark Structured Streaming, while handling reading and writing on your own outside DQX using Spark API.
16+
* [DQX Demo Notebook for Lakeflow Pipelines (formerly DLT)](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_dlt_demo.py) - demonstrates how to use DQX as a library with Lakeflow Pipelines.
17+
* [DQX Asset Bundles Demo](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_demo_asset_bundle/README.md) - demonstrates how to use DQX as a library with Databricks Asset Bundles.
18+
* [DQX Demo for dbt](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_demo_dbt/README.md) - demonstrates how to use DQX as a library with dbt projects.
1919

2020
## Deploy as Workspace Tool
21-
* [DQX Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.2/demos/dqx_demo_tool.py) - demonstrates how to use DQX as a tool when installed in the workspace.
21+
* [DQX Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_demo_tool.py) - demonstrates how to use DQX as a tool when installed in the workspace.
2222

2323
## Use Cases
24-
* [DQX for PII Detection Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.2/demos/dqx_demo_pii_detection.py) - demonstrates how to use DQX to check data for Personally Identifiable Information (PII).
25-
* [DQX for Manufacturing Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.2/demos/dqx_manufacturing_demo.py) - demonstrates how to use DQX to check data quality for Manufacturing Industry datasets.
24+
* [DQX for PII Detection Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_demo_pii_detection.py) - demonstrates how to use DQX to check data for Personally Identifiable Information (PII).
25+
* [DQX for Manufacturing Notebook](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_manufacturing_demo.py) - demonstrates how to use DQX to check data quality for Manufacturing Industry datasets.
2626

2727
<br />
2828
<Admonition type="tip" title="Execution Environment">

docs/dqx/docs/dev/contributing.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ In all cases, the full test suite is executed on code merged into `main` branch
114114

115115
This section provides a step-by-step guide for setting up your local environment and dependencies for efficient development.
116116

117-
To start with, install the required python version on your computer (see `python=<version>` field in [pyproject.toml](https://github.com/databrickslabs/dqx/blob/v0.9.2/pyproject.toml)).
117+
To start with, install the required python version on your computer (see `python=<version>` field in [pyproject.toml](https://github.com/databrickslabs/dqx/blob/v0.9.3/pyproject.toml)).
118118
On MacOSX, run:
119119
```
120120
# double check the required version in pyproject.toml

docs/dqx/docs/guide/quality_checks_apply.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -800,7 +800,7 @@ Below is a sample output of a check stored in a result column (error or warning)
800800
```
801801

802802
The structure of the result columns is an array of struct containing the following fields
803-
(see the exact structure [here](https://github.com/databrickslabs/dqx/blob/v0.9.2/src/databricks/labs/dqx/schema/dq_result_schema.py)):
803+
(see the exact structure [here](https://github.com/databrickslabs/dqx/blob/v0.9.3/src/databricks/labs/dqx/schema/dq_result_schema.py)):
804804
- `name`: name of the check (string type).
805805
- `message`: message describing the quality issue (string type).
806806
- `columns`: name of the column(s) where the quality issue was found (string type).

docs/dqx/docs/reference/quality_checks.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ import TabItem from '@theme/TabItem';
1111
This page provides a reference for the quality check functions available in DQX.
1212
The terms `checks` and `rules` are used interchangeably in the documentation, as they are synonymous in DQX.
1313

14-
You can explore the implementation details of the check functions [here](https://github.com/databrickslabs/dqx/blob/v0.9.2/src/databricks/labs/dqx/check_funcs.py).
14+
You can explore the implementation details of the check functions [here](https://github.com/databrickslabs/dqx/blob/v0.9.3/src/databricks/labs/dqx/check_funcs.py).
1515

1616
## Row-level checks reference
1717

@@ -3071,7 +3071,7 @@ The built-in check supports several configurable parameters:
30713071
DQX automatically installs the built-in entity recognition models at runtime if they are not already available.
30723072
However, for better performance and to avoid potential out-of-memory issues, it is recommended to pre-install models using pip install.
30733073
Any additional models used in a custom configuration must also be installed on your Databricks cluster.
3074-
See the [Using DQX for PII Detection](https://github.com/databrickslabs/dqx/blob/v0.9.2/demos/dqx_demo_pii_detection.py) notebook for examples of custom model installation.
3074+
See the [Using DQX for PII Detection](https://github.com/databrickslabs/dqx/blob/v0.9.3/demos/dqx_demo_pii_detection.py) notebook for examples of custom model installation.
30753075
</Admonition>
30763076

30773077
#### Using Built-in NLP Engine Configurations
Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "0.9.2"
1+
__version__ = "0.9.3"

0 commit comments

Comments
 (0)