databrickslabs
diff --git a/‎CHANGELOG.md‎
Lines changed: 9 additions & 0 deletions b/‎CHANGELOG.md‎
Lines changed: 9 additions & 0 deletions
diff --git a/‎demos/dqx_demo_library.py‎
Lines changed: 91 additions & 0 deletions b/‎demos/dqx_demo_library.py‎
Lines changed: 91 additions & 0 deletions
diff --git a/‎docs/dqx/docs/demos.mdx‎
Lines changed: 8 additions & 8 deletions b/‎docs/dqx/docs/demos.mdx‎
Lines changed: 8 additions & 8 deletions
diff --git a/‎docs/dqx/docs/guide/quality_checks.mdx‎
Lines changed: 1 addition & 1 deletion b/‎docs/dqx/docs/guide/quality_checks.mdx‎
Lines changed: 1 addition & 1 deletion
@@ -1,5 +1,14 @@
 # Version changelog
 
+## 0.7.1
+
+* Added type validation for apply checks method ([#465](https://github.com/databrickslabs/dqx/issues/465)). The library now enforces stricter type validation for data quality rules, ensuring all elements in the checks list are instances of `DQRule`. If invalid types are encountered, a `TypeError` is raised with a descriptive error message, suggesting alternative methods for passing checks as dictionaries. Additionally, input attribute validation has been enhanced to verify the criticality value, which must be either `warn` or "error", and raises a `ValueError` for invalid values.
+* Databricks Asset Bundle (DAB) demo ([#443](https://github.com/databrickslabs/dqx/issues/443)). A new demo showcasing the usage of DQX with DAB has been added.
+* Check to compare datasets ([#463](https://github.com/databrickslabs/dqx/issues/463)). A new dataset-level check, `compare_datasets`, has been introduced to compare two DataFrames at both row and column levels, providing detailed information about differences, including new or missing rows and column-level changes. This check compares only columns present in both DataFrames, excludes map type columns, and can be customized to exclude specific columns or perform a FULL OUTER JOIN to identify missing records. The `compare_datasets` check can be used with a reference DataFrame or table name, and its results include information about missing and extra rows, as well as a map of changed columns and their differences. 
+* Demo on how to use DQX with dbt projects ([#460](https://github.com/databrickslabs/dqx/issues/460)). New demo has been added to showcase on how to use DQX with dbt transformation framework.
+* IP V4 address validation ([#464](https://github.com/databrickslabs/dqx/issues/464)). The library has been enhanced with new checks to validate IPv4 address. Two new row checks, `is_valid_ipv4_address` and `is_ipv4_address_in_cidr`, have been introduced to verify whether values in a specified column are valid IPv4 addresses and whether they fall within a given CIDR block, respectively.
+* Improved loading checks from Delta table ([#462](https://github.com/databrickslabs/dqx/issues/462)). Loading checks from Delta tables have been improved to eliminate the need to escape string arguments, providing a more robust and user-friendly experience for working with quality checks defined in Delta tables.
+
 ## 0.7.0
 
 * Added end-to-end quality checking methods ([#364](https://github.com/databrickslabs/dqx/issues/364)). The library now includes end-to-end quality checking methods, allowing users to read data from a table or view, apply checks, and write the results to a table. The `DQEngine` class has been updated to utilize `InputConfig` and `OutputConfig` objects to handle input and output configurations, providing more flexibility in the quality checking flow. The `apply_checks_and_write_to_table` and `apply_checks_by_metadata_and_write_to_table` methods have been introduced to support this functionality, applying checks using DQX classes and configuration, respectively. Additionally, the profiler configuration options have been reorganized into `input_config` and `profiler_config` sections, making it easier to understand and customize the profiling process. The changes aim to provide a more streamlined and efficient way to perform end-to-end quality checking and data validation, with improved configuration flexibility and readability.
 
@@ -1036,6 +1036,97 @@ def apply(df: DataFrame, spark: SparkSession, ref_dfs: dict[str, DataFrame]) ->
 
 # COMMAND ----------
 
+# MAGIC %md
+# MAGIC ### Comparison of datasets
+# MAGIC
+# MAGIC You can use DQX to compare two datasets (DataFrames or Tables) to identify differences at the row and column level. This supports use cases such as:
+# MAGIC * Migration validation
+# MAGIC * Drift detection
+# MAGIC * Regression testing between pipeline versions
+# MAGIC * Synchronization checks between source and target systems
+
+# COMMAND ----------
+
+from pyspark.sql import functions as F
+from databricks.labs.dqx.rule import DQDatasetRule
+from databricks.labs.dqx import check_funcs
+from databricks.labs.dqx.engine import DQEngine
+from databricks.sdk import WorkspaceClient
+
+
+schema = "id: int, id2: int, val: string"
+df = spark.createDataFrame(
+    [
+      
+        [1, 1, "val1"],
+        [4, 5, "val1"],
+        [6, 6, "val1"],
+        [None, None, None],
+    ],
+    schema,
+)
+
+df2 = spark.createDataFrame(
+    [
+        [1, 1, "val1"],
+        [1, 2, "val2"],
+        [5, 5, "val3"],
+    ],
+    schema,
+)
+
+checks = [
+  DQDatasetRule(
+        criticality="error",
+        check_func=check_funcs.compare_datasets,
+        columns=["id", "id2"],  # pass df.columns if you don't have primary key and want to match against all columns
+        check_func_kwargs={
+          "ref_columns": ["id", "id2"], # pass df2.columns if you don't have primary key and want to match against all columns
+          "ref_df_name": "ref_df",
+          "check_missing_records": True  # if wanting to get info about missing records
+        },
+      )
+]
+
+ref_dfs = {"ref_df": df2}
+
+dq_engine = DQEngine(WorkspaceClient())
+output_df = dq_engine.apply_checks(df, checks, ref_dfs=ref_dfs)
+display(output_df)
+
+# COMMAND ----------
+
+# MAGIC %md
+# MAGIC The detailed log of differences is stored as a JSON string in the message field. It can be parsed into a structured format for easier inspection and analysis.
+
+# COMMAND ----------
+
+message_schema = "struct<row_missing:boolean, row_extra:boolean, changed:map<string,map<string,string>>>"
+
+output_df = output_df.withColumn(
+    "_errors",
+    F.transform(
+        "_errors",
+        lambda x: x.withField(
+            "diff_log",
+            F.from_json(x["message"], message_schema),
+        )
+    )
+).withColumn(
+    "_warnings",
+    F.transform(
+        "_warnings",
+        lambda x: x.withField(
+            "diff_log",
+            F.from_json(x["message"], message_schema),
+        )
+    )
+)
+
+display(output_df)
+
+# COMMAND ----------
+
 # MAGIC %md
 # MAGIC ### Applying custom column names and adding user metadata
 
 
@@ -8,18 +8,18 @@ import Admonition from '@theme/Admonition';
 
 Import the following notebooks in the Databricks workspace to try DQX out:
 ## Use as Library
-* [DQX Quick Start Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_quick_start_demo_library.py) - quickstart on how to use DQX as a library.
-* [DQX Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_demo_library.py) - demonstrates how to use DQX as a library.
-* [DQX Lakeflow Pipelines Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_dlt_demo.py) - demonstrates how to use DQX with Lakeflow Pipelines (formerly Delta Live Tables (DLT)).
-* [DQX Asset Bundles Demo](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_demo_asset_bundle/README.md) - demonstrates how to use DQX as a library with Databricks Asset Bundles.
-* [DQX Demo with dbt](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_demo_dbt/README.md) - demonstrates how to use DQX as a library with dbt projects.
+* [DQX Quick Start Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_quick_start_demo_library.py) - quickstart on how to use DQX as a library.
+* [DQX Demo Notebook (library)](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_library.py) - demonstrates how to use DQX as a library.
+* [DQX Lakeflow Pipelines Demo Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_dlt_demo.py) - demonstrates how to use DQX with Lakeflow Pipelines (formerly Delta Live Tables (DLT)).
+* [DQX Asset Bundles Demo](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_asset_bundle/README.md) - demonstrates how to use DQX as a library with Databricks Asset Bundles.
+* [DQX Demo with dbt](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_dbt/README.md) - demonstrates how to use DQX as a library with dbt projects.
 
 ## Deploy as Workspace Tool
-* [DQX Demo Notebook (tool)](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_demo_tool.py) - demonstrates how to use DQX as a tool when installed in the workspace.
+* [DQX Demo Notebook (tool)](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_tool.py) - demonstrates how to use DQX as a tool when installed in the workspace.
 
 ## Use Cases
-* [DQX for PII Detection Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_demo_pii_detection.py) - demonstrates how to use DQX to check data for Personally Identifiable Information (PII).
-* [DQX for Manufacturing Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.0/demos/dqx_manufacturing_demo.py) - demonstrates how to use DQX to check data quality for Manufacturing Industry datasets.
+* [DQX for PII Detection Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_demo_pii_detection.py) - demonstrates how to use DQX to check data for Personally Identifiable Information (PII).
+* [DQX for Manufacturing Notebook](https://github.com/databrickslabs/dqx/blob/v0.7.1/demos/dqx_manufacturing_demo.py) - demonstrates how to use DQX to check data quality for Manufacturing Industry datasets.
 
 <br />
 <Admonition type="tip" title="Execution Environment">
 
@@ -1095,7 +1095,7 @@ Below is a sample output of a check as stored in a result column:
 ```
 
 The structure of the result columns is an array of struct containing the following fields
-(see the exact structure [here](https://github.com/databrickslabs/dqx/blob/v0.7.0/src/databricks/labs/dqx/schema/dq_result_schema.py)):
+(see the exact structure [here](https://github.com/databrickslabs/dqx/blob/v0.7.1/src/databricks/labs/dqx/schema/dq_result_schema.py)):
 - `name`: name of the check (string type).
 - `message`: message describing the quality issue (string type).
 - `columns`: name of the column(s) where the quality issue was found (string type).