Skip to content

Commit 6607b36

Browse files
authored
Release v0.7.1 (#471)
* Added type validation for apply checks method ([#465](#465)). The library now enforces stricter type validation for data quality rules, ensuring all elements in the checks list are instances of `DQRule`. If invalid types are encountered, a `TypeError` is raised with a descriptive error message, suggesting alternative methods for passing checks as dictionaries. Additionally, input attribute validation has been enhanced to verify the criticality value, which must be either `warn` or "error", and raises a `ValueError` for invalid values. * Databricks Asset Bundle (DAB) demo ([#443](#443)). A new demo showcasing the usage of DQX with DAB has been added. * Check to compare datasets ([#463](#463)). A new dataset-level check, `compare_datasets`, has been introduced to compare two DataFrames at both row and column levels, providing detailed information about differences, including new or missing rows and column-level changes. This check compares only columns present in both DataFrames, excludes map type columns, and can be customized to exclude specific columns or perform a FULL OUTER JOIN to identify missing records. The `compare_datasets` check can be used with a reference DataFrame or table name, and its results include information about missing and extra rows, as well as a map of changed columns and their differences. * Demo on how to use DQX with dbt projects ([#460](#460)). New demo has been added to showcase on how to use DQX with dbt transformation framework. * IP V4 address validation ([#464](#464)). The library has been enhanced with new checks to validate IPv4 address. Two new row checks, `is_valid_ipv4_address` and `is_ipv4_address_in_cidr`, have been introduced to verify whether values in a specified column are valid IPv4 addresses and whether they fall within a given CIDR block, respectively. * Improved loading checks from Delta table ([#462](#462)). Loading checks from Delta tables have been improved to eliminate the need to escape string arguments, providing a more robust and user-friendly experience for working with quality checks defined in Delta tables.
1 parent 2bb47af commit 6607b36

File tree

7 files changed

+125
-16
lines changed

7 files changed

+125
-16
lines changed

CHANGELOG.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,14 @@
11
# Version changelog
22

3+
## 0.7.1
4+
5+
* Added type validation for apply checks method ([#465](https://github.com/databrickslabs/dqx/issues/465)). The library now enforces stricter type validation for data quality rules, ensuring all elements in the checks list are instances of `DQRule`. If invalid types are encountered, a `TypeError` is raised with a descriptive error message, suggesting alternative methods for passing checks as dictionaries. Additionally, input attribute validation has been enhanced to verify the criticality value, which must be either `warn` or "error", and raises a `ValueError` for invalid values.
6+
* Databricks Asset Bundle (DAB) demo ([#443](https://github.com/databrickslabs/dqx/issues/443)). A new demo showcasing the usage of DQX with DAB has been added.
7+
* Check to compare datasets ([#463](https://github.com/databrickslabs/dqx/issues/463)). A new dataset-level check, `compare_datasets`, has been introduced to compare two DataFrames at both row and column levels, providing detailed information about differences, including new or missing rows and column-level changes. This check compares only columns present in both DataFrames, excludes map type columns, and can be customized to exclude specific columns or perform a FULL OUTER JOIN to identify missing records. The `compare_datasets` check can be used with a reference DataFrame or table name, and its results include information about missing and extra rows, as well as a map of changed columns and their differences.
8+
* Demo on how to use DQX with dbt projects ([#460](https://github.com/databrickslabs/dqx/issues/460)). New demo has been added to showcase on how to use DQX with dbt transformation framework.
9+
* IP V4 address validation ([#464](https://github.com/databrickslabs/dqx/issues/464)). The library has been enhanced with new checks to validate IPv4 address. Two new row checks, `is_valid_ipv4_address` and `is_ipv4_address_in_cidr`, have been introduced to verify whether values in a specified column are valid IPv4 addresses and whether they fall within a given CIDR block, respectively.
10+
* Improved loading checks from Delta table ([#462](https://github.com/databrickslabs/dqx/issues/462)). Loading checks from Delta tables have been improved to eliminate the need to escape string arguments, providing a more robust and user-friendly experience for working with quality checks defined in Delta tables.
11+
312
## 0.7.0
413

514
* Added end-to-end quality checking methods ([#364](https://github.com/databrickslabs/dqx/issues/364)). The library now includes end-to-end quality checking methods, allowing users to read data from a table or view, apply checks, and write the results to a table. The `DQEngine` class has been updated to utilize `InputConfig` and `OutputConfig` objects to handle input and output configurations, providing more flexibility in the quality checking flow. The `apply_checks_and_write_to_table` and `apply_checks_by_metadata_and_write_to_table` methods have been introduced to support this functionality, applying checks using DQX classes and configuration, respectively. Additionally, the profiler configuration options have been reorganized into `input_config` and `profiler_config` sections, making it easier to understand and customize the profiling process. The changes aim to provide a more streamlined and efficient way to perform end-to-end quality checking and data validation, with improved configuration flexibility and readability.

demos/dqx_demo_library.py

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1036,6 +1036,97 @@ def apply(df: DataFrame, spark: SparkSession, ref_dfs: dict[str, DataFrame]) ->
10361036

10371037
# COMMAND ----------
10381038

1039+
# MAGIC %md
1040+
# MAGIC ### Comparison of datasets
1041+
# MAGIC
1042+
# MAGIC You can use DQX to compare two datasets (DataFrames or Tables) to identify differences at the row and column level. This supports use cases such as:
1043+
# MAGIC * Migration validation
1044+
# MAGIC * Drift detection
1045+
# MAGIC * Regression testing between pipeline versions
1046+
# MAGIC * Synchronization checks between source and target systems
1047+
1048+
# COMMAND ----------
1049+
1050+
from pyspark.sql import functions as F
1051+
from databricks.labs.dqx.rule import DQDatasetRule
1052+
from databricks.labs.dqx import check_funcs
1053+
from databricks.labs.dqx.engine import DQEngine
1054+
from databricks.sdk import WorkspaceClient
1055+
1056+
1057+
schema = "id: int, id2: int, val: string"
1058+
df = spark.createDataFrame(
1059+
[
1060+
1061+
[1, 1, "val1"],
1062+
[4, 5, "val1"],
1063+
[6, 6, "val1"],
1064+
[None, None, None],
1065+
],
1066+
schema,
1067+
)
1068+
1069+
df2 = spark.createDataFrame(
1070+
[
1071+
[1, 1, "val1"],
1072+
[1, 2, "val2"],
1073+
[5, 5, "val3"],
1074+
],
1075+
schema,
1076+
)
1077+
1078+
checks = [
1079+
DQDatasetRule(
1080+
criticality="error",
1081+
check_func=check_funcs.compare_datasets,
1082+
columns=["id", "id2"], # pass df.columns if you don't have primary key and want to match against all columns
1083+
check_func_kwargs={
1084+
"ref_columns": ["id", "id2"], # pass df2.columns if you don't have primary key and want to match against all columns
1085+
"ref_df_name": "ref_df",
1086+
"check_missing_records": True # if wanting to get info about missing records
1087+
},
1088+
)
1089+
]
1090+
1091+
ref_dfs = {"ref_df": df2}
1092+
1093+
dq_engine = DQEngine(WorkspaceClient())
1094+
output_df = dq_engine.apply_checks(df, checks, ref_dfs=ref_dfs)
1095+
display(output_df)
1096+
1097+
# COMMAND ----------
1098+
1099+
# MAGIC %md
1100+
# MAGIC The detailed log of differences is stored as a JSON string in the message field. It can be parsed into a structured format for easier inspection and analysis.
1101+
1102+
# COMMAND ----------
1103+
1104+
message_schema = "struct<row_missing:boolean, row_extra:boolean, changed:map<string,map<string,string>>>"
1105+
1106+
output_df = output_df.withColumn(
1107+
"_errors",
1108+
F.transform(
1109+
"_errors",
1110+
lambda x: x.withField(
1111+
"diff_log",
1112+
F.from_json(x["message"], message_schema),
1113+
)
1114+
)
1115+
).withColumn(
1116+
"_warnings",
1117+
F.transform(
1118+
"_warnings",
1119+
lambda x: x.withField(
1120+
"diff_log",
1121+
F.from_json(x["message"], message_schema),
1122+
)
1123+
)
1124+
)
1125+
1126+
display(output_df)
1127+
1128+
# COMMAND ----------
1129+
10391130
# MAGIC %md
10401131
# MAGIC ### Applying custom column names and adding user metadata
10411132

docs/dqx/docs/demos.mdx

Lines changed: 8 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -8,18 +8,18 @@ import Admonition from '@theme/Admonition';
88

99
Import the following notebooks in the Databricks workspace to try DQX out:
1010
## Use as Library
11-
* [DQX Quick Start Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_quick_start_demo_library.py) - quickstart on how to use DQX as a library.
12-
* [DQX Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_demo_library.py) - demonstrates how to use DQX as a library.
13-
* [DQX Lakeflow Pipelines Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_dlt_demo.py) - demonstrates how to use DQX with Lakeflow Pipelines (formerly Delta Live Tables (DLT)).
14-
* [DQX Asset Bundles Demo](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_demo_asset_bundle/README.md) - demonstrates how to use DQX as a library with Databricks Asset Bundles.
15-
* [DQX Demo with dbt](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_demo_dbt/README.md) - demonstrates how to use DQX as a library with dbt projects.
11+
* [DQX Quick Start Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_quick_start_demo_library.py) - quickstart on how to use DQX as a library.
12+
* [DQX Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_library.py) - demonstrates how to use DQX as a library.
13+
* [DQX Lakeflow Pipelines Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_dlt_demo.py) - demonstrates how to use DQX with Lakeflow Pipelines (formerly Delta Live Tables (DLT)).
14+
* [DQX Asset Bundles Demo](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_asset_bundle/README.md) - demonstrates how to use DQX as a library with Databricks Asset Bundles.
15+
* [DQX Demo with dbt](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_dbt/README.md) - demonstrates how to use DQX as a library with dbt projects.
1616

1717
## Deploy as Workspace Tool
18-
* [DQX Demo Notebook (tool)](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_demo_tool.py) - demonstrates how to use DQX as a tool when installed in the workspace.
18+
* [DQX Demo Notebook (tool)](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_tool.py) - demonstrates how to use DQX as a tool when installed in the workspace.
1919

2020
## Use Cases
21-
* [DQX for PII Detection Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_demo_pii_detection.py) - demonstrates how to use DQX to check data for Personally Identifiable Information (PII).
22-
* [DQX for Manufacturing Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_manufacturing_demo.py) - demonstrates how to use DQX to check data quality for Manufacturing Industry datasets.
21+
* [DQX for PII Detection Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_pii_detection.py) - demonstrates how to use DQX to check data for Personally Identifiable Information (PII).
22+
* [DQX for Manufacturing Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_manufacturing_demo.py) - demonstrates how to use DQX to check data quality for Manufacturing Industry datasets.
2323

2424
<br />
2525
<Admonition type="tip" title="Execution Environment">

docs/dqx/docs/guide/quality_checks.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1095,7 +1095,7 @@ Below is a sample output of a check as stored in a result column:
10951095
```
10961096

10971097
The structure of the result columns is an array of struct containing the following fields
1098-
(see the exact structure [here](https://github.com/databrickslabs/dqx/blob/v0.7.0/src/databricks/labs/dqx/schema/dq_result_schema.py)):
1098+
(see the exact structure [here](https://github.com/databrickslabs/dqx/blob/v0.7.1/src/databricks/labs/dqx/schema/dq_result_schema.py)):
10991099
- `name`: name of the check (string type).
11001100
- `message`: message describing the quality issue (string type).
11011101
- `columns`: name of the column(s) where the quality issue was found (string type).

0 commit comments

Comments
 (0)