You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Added type validation for apply checks method
([#465](#465)). The library
now enforces stricter type validation for data quality rules, ensuring
all elements in the checks list are instances of `DQRule`. If invalid
types are encountered, a `TypeError` is raised with a descriptive error
message, suggesting alternative methods for passing checks as
dictionaries. Additionally, input attribute validation has been enhanced
to verify the criticality value, which must be either `warn` or "error",
and raises a `ValueError` for invalid values.
* Databricks Asset Bundle (DAB) demo
([#443](#443)). A new demo
showcasing the usage of DQX with DAB has been added.
* Check to compare datasets
([#463](#463)). A new
dataset-level check, `compare_datasets`, has been introduced to compare
two DataFrames at both row and column levels, providing detailed
information about differences, including new or missing rows and
column-level changes. This check compares only columns present in both
DataFrames, excludes map type columns, and can be customized to exclude
specific columns or perform a FULL OUTER JOIN to identify missing
records. The `compare_datasets` check can be used with a reference
DataFrame or table name, and its results include information about
missing and extra rows, as well as a map of changed columns and their
differences.
* Demo on how to use DQX with dbt projects
([#460](#460)). New demo has
been added to showcase on how to use DQX with dbt transformation
framework.
* IP V4 address validation
([#464](#464)). The library
has been enhanced with new checks to validate IPv4 address. Two new row
checks, `is_valid_ipv4_address` and `is_ipv4_address_in_cidr`, have been
introduced to verify whether values in a specified column are valid IPv4
addresses and whether they fall within a given CIDR block, respectively.
* Improved loading checks from Delta table
([#462](#462)). Loading
checks from Delta tables have been improved to eliminate the need to
escape string arguments, providing a more robust and user-friendly
experience for working with quality checks defined in Delta tables.
Copy file name to clipboardExpand all lines: CHANGELOG.md
+9Lines changed: 9 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,5 +1,14 @@
1
1
# Version changelog
2
2
3
+
## 0.7.1
4
+
5
+
* Added type validation for apply checks method ([#465](https://github.com/databrickslabs/dqx/issues/465)). The library now enforces stricter type validation for data quality rules, ensuring all elements in the checks list are instances of `DQRule`. If invalid types are encountered, a `TypeError` is raised with a descriptive error message, suggesting alternative methods for passing checks as dictionaries. Additionally, input attribute validation has been enhanced to verify the criticality value, which must be either `warn` or "error", and raises a `ValueError` for invalid values.
6
+
* Databricks Asset Bundle (DAB) demo ([#443](https://github.com/databrickslabs/dqx/issues/443)). A new demo showcasing the usage of DQX with DAB has been added.
7
+
* Check to compare datasets ([#463](https://github.com/databrickslabs/dqx/issues/463)). A new dataset-level check, `compare_datasets`, has been introduced to compare two DataFrames at both row and column levels, providing detailed information about differences, including new or missing rows and column-level changes. This check compares only columns present in both DataFrames, excludes map type columns, and can be customized to exclude specific columns or perform a FULL OUTER JOIN to identify missing records. The `compare_datasets` check can be used with a reference DataFrame or table name, and its results include information about missing and extra rows, as well as a map of changed columns and their differences.
8
+
* Demo on how to use DQX with dbt projects ([#460](https://github.com/databrickslabs/dqx/issues/460)). New demo has been added to showcase on how to use DQX with dbt transformation framework.
9
+
* IP V4 address validation ([#464](https://github.com/databrickslabs/dqx/issues/464)). The library has been enhanced with new checks to validate IPv4 address. Two new row checks, `is_valid_ipv4_address` and `is_ipv4_address_in_cidr`, have been introduced to verify whether values in a specified column are valid IPv4 addresses and whether they fall within a given CIDR block, respectively.
10
+
* Improved loading checks from Delta table ([#462](https://github.com/databrickslabs/dqx/issues/462)). Loading checks from Delta tables have been improved to eliminate the need to escape string arguments, providing a more robust and user-friendly experience for working with quality checks defined in Delta tables.
11
+
3
12
## 0.7.0
4
13
5
14
* Added end-to-end quality checking methods ([#364](https://github.com/databrickslabs/dqx/issues/364)). The library now includes end-to-end quality checking methods, allowing users to read data from a table or view, apply checks, and write the results to a table. The `DQEngine` class has been updated to utilize `InputConfig` and `OutputConfig` objects to handle input and output configurations, providing more flexibility in the quality checking flow. The `apply_checks_and_write_to_table` and `apply_checks_by_metadata_and_write_to_table` methods have been introduced to support this functionality, applying checks using DQX classes and configuration, respectively. Additionally, the profiler configuration options have been reorganized into `input_config` and `profiler_config` sections, making it easier to understand and customize the profiling process. The changes aim to provide a more streamlined and efficient way to perform end-to-end quality checking and data validation, with improved configuration flexibility and readability.
# MAGIC You can use DQX to compare two datasets (DataFrames or Tables) to identify differences at the row and column level. This supports use cases such as:
1043
+
# MAGIC * Migration validation
1044
+
# MAGIC * Drift detection
1045
+
# MAGIC * Regression testing between pipeline versions
1046
+
# MAGIC * Synchronization checks between source and target systems
1047
+
1048
+
# COMMAND ----------
1049
+
1050
+
frompyspark.sqlimportfunctionsasF
1051
+
fromdatabricks.labs.dqx.ruleimportDQDatasetRule
1052
+
fromdatabricks.labs.dqximportcheck_funcs
1053
+
fromdatabricks.labs.dqx.engineimportDQEngine
1054
+
fromdatabricks.sdkimportWorkspaceClient
1055
+
1056
+
1057
+
schema="id: int, id2: int, val: string"
1058
+
df=spark.createDataFrame(
1059
+
[
1060
+
1061
+
[1, 1, "val1"],
1062
+
[4, 5, "val1"],
1063
+
[6, 6, "val1"],
1064
+
[None, None, None],
1065
+
],
1066
+
schema,
1067
+
)
1068
+
1069
+
df2=spark.createDataFrame(
1070
+
[
1071
+
[1, 1, "val1"],
1072
+
[1, 2, "val2"],
1073
+
[5, 5, "val3"],
1074
+
],
1075
+
schema,
1076
+
)
1077
+
1078
+
checks= [
1079
+
DQDatasetRule(
1080
+
criticality="error",
1081
+
check_func=check_funcs.compare_datasets,
1082
+
columns=["id", "id2"], # pass df.columns if you don't have primary key and want to match against all columns
1083
+
check_func_kwargs={
1084
+
"ref_columns": ["id", "id2"], # pass df2.columns if you don't have primary key and want to match against all columns
1085
+
"ref_df_name": "ref_df",
1086
+
"check_missing_records": True# if wanting to get info about missing records
# MAGIC The detailed log of differences is stored as a JSON string in the message field. It can be parsed into a structured format for easier inspection and analysis.
*[DQX Quick Start Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_quick_start_demo_library.py) - quickstart on how to use DQX as a library.
12
-
*[DQX Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_demo_library.py) - demonstrates how to use DQX as a library.
13
-
*[DQX Lakeflow Pipelines Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_dlt_demo.py) - demonstrates how to use DQX with Lakeflow Pipelines (formerly Delta Live Tables (DLT)).
14
-
*[DQX Asset Bundles Demo](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_demo_asset_bundle/README.md) - demonstrates how to use DQX as a library with Databricks Asset Bundles.
15
-
*[DQX Demo with dbt](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_demo_dbt/README.md) - demonstrates how to use DQX as a library with dbt projects.
11
+
*[DQX Quick Start Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_quick_start_demo_library.py) - quickstart on how to use DQX as a library.
12
+
*[DQX Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_library.py) - demonstrates how to use DQX as a library.
13
+
*[DQX Lakeflow Pipelines Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_dlt_demo.py) - demonstrates how to use DQX with Lakeflow Pipelines (formerly Delta Live Tables (DLT)).
14
+
*[DQX Asset Bundles Demo](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_asset_bundle/README.md) - demonstrates how to use DQX as a library with Databricks Asset Bundles.
15
+
*[DQX Demo with dbt](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_dbt/README.md) - demonstrates how to use DQX as a library with dbt projects.
16
16
17
17
## Deploy as Workspace Tool
18
-
*[DQX Demo Notebook (tool)](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_demo_tool.py) - demonstrates how to use DQX as a tool when installed in the workspace.
18
+
*[DQX Demo Notebook (tool)](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_tool.py) - demonstrates how to use DQX as a tool when installed in the workspace.
19
19
20
20
## Use Cases
21
-
*[DQX for PII Detection Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_demo_pii_detection.py) - demonstrates how to use DQX to check data for Personally Identifiable Information (PII).
22
-
*[DQX for Manufacturing Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_manufacturing_demo.py) - demonstrates how to use DQX to check data quality for Manufacturing Industry datasets.
21
+
*[DQX for PII Detection Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_pii_detection.py) - demonstrates how to use DQX to check data for Personally Identifiable Information (PII).
22
+
*[DQX for Manufacturing Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_manufacturing_demo.py) - demonstrates how to use DQX to check data quality for Manufacturing Industry datasets.
0 commit comments