Releases · databrickslabs/dqx

07 Nov 08:11

github-actions

v0.10.0

6432e45

v0.10.0 Latest

Latest

Added Data Quality Summary Metrics (#553). The data quality engine has been enhanced with the ability to track and manage summary metrics for data quality validation, leveraging Spark's Observation feature. A new DQMetricsObserver class has been introduced to manage Spark observations and track summary metrics on datasets checked with the engine. The DQEngine class has been updated to optionally return the Spark observation associated with a given run, allowing users to access and save summary metrics. The engine now supports also writing summary metrics to a table using the metrics_config parameter, and a new save_summary_metrics method has been added to save data quality summary metrics to a table. Additionally, the engine has been updated to include a unique run_id field in the detailed per-row quality results, enabling cross-referencing with summary metrics. The changes also include updates to the configuration file to support the storage of summary metrics. Overall, these enhancements provide a more comprehensive and flexible data quality checking capability, allowing users to track and analyze data quality issues more effectively.
LLM assisted rules generation (#577). This release introduces a significant enhancement to the data quality rules generation process with the integration of AI-assisted rules generation using large language models (LLMs). The DQGenerator class now includes a generate_dq_rules_ai_assisted method, which takes user input in natural language and optionally a schema from an input table to generate data quality rules. These rules are then validated for correctness. The AI-assisted rules generation feature supports both programmatic and no-code approaches. Additionally, the feature enables the use of different LLM models and gives the possibility to use custom check functions. The release also includes various updates to the documentation, configuration files, and testing framework to support the new AI-assisted rules generation feature, ensuring a more streamlined and efficient process for defining and applying data quality rules.
Added Lakebase checks storage backend (#550). A Lakebase checks storage backend was added, allowing users to store and manage their data quality rules in a centralized lakabase table, in addition to the existing Delta table storage. The checks_location resolution has been updated to accommodate Lakebase, supporting both table and file storage, with flexible formatting options, including "catalog.schema.table" and "database.schema.table". The Lakebase checks storage backend is configurable through the LakebaseChecksStorageConfig class, which includes fields for instance name, user, location, port, run configuration name, and write mode. This update provides users with more flexibility in storing and loading quality checks, ensuring that checks are saved correctly regardless of the specified location format.
Added runtime validation of sql expressions (#625). The data quality check functionality has been enhanced with runtime validation of SQL expressions, ensuring that specified fields can be resolved in the input DataFrame and that SQL expressions are valid before evaluation. If an SQL expression is invalid, the check evaluation is skipped and the results include a check failure with a descriptive message. Additionally, the configuration validation for Unity Catalog volume file paths has been improved to enforce a specific format, preventing invalid configurations and providing more informative error messages.
Fixed docs (#598). The documentation build process has undergone significant improvements to enhance efficiency and maintainability.
Improved Config Serialization (#676). Several updates have been made to improve the functionality, consistency, and maintainability of the codebase. The configuration loading functionality has been refactored to utilize the ConfigSerializer class, which handles the serialization and deserialization of workspace and run configurations.
Restore use of hatch-fancy-pypi-readme to fix images in PyPi (#601). The image source path for the logo in the README has been modified to correctly display the logo image when rendered, particularly on PyPi.
Skip check evaluation if columns or filter cannot be resolved in the input DataFrame (#609). DQX now skip check evaluation if columns or filters are incorrect allowing other checks to proceed even if one rule fails. The DQX engine validates specified column, columns and filter fields against the input DataFrame before applying checks, skipping evaluation and providing informative error messages if any fields are invalid.
Updated user guide docs (#607). The documentation for quality checking and integration options has been updated to provide accurate and detailed information on supported types and approaches. Quality checking can be performed in-transit (pre-commit), validating data on the fly during processing, or at-rest, checking existing data stored in tables.
Improved build process (#618). The hatch version has been updated to 1.15.0 to avoid compatibility issues with click version 8.3 and later, which introduced a bug affecting hatch. Additionally, the project's dependencies have been updated, including bumping the databricks-labs-pytester version from 0.7.2 to 0.7.4, and code refactoring has been done to use a single Lakebase instance for all integration tests, with retry logic added to handle cases where the workspace quota limit for the number of Lakebase instances is exceeded, enhancing the testing infrastructure and improving test reliability. Furthermore, documentation updates have been made to clarify the application of quality checks to data using DQX. These changes aim to improve the efficiency, reliability, and clarity of the project's testing and documentation infrastructure.

BREAKING CHANGES!

Added new field run_id to the detailed per-row quality results. This may or may not be a breaking change for you depending on how you leverage the results today. This is a unique run ID recorded in the summary metrics as well as detailed quality checking results to enable cross-referencing. When reusing the same DQEngine instance, the run ID stays the same. Each apply checks execution does not generate a new run ID for the same instance. It is only changed when new engine and observer (if using one) is created.

LIMITATIONS

Saving metrics to a table requires using a classic compute cluster in Dedicated Access Mode. This limitation will be lifted observations issue is fixed in Spark Connect.

Contributors: @mwojtyczka, @ghanse, @souravg-db2, @alexott, @tlgnr

Contributors

alexott, mwojtyczka, and 3 other contributors

Assets 4

03 Oct 16:53

mwojtyczka

v0.9.3

d9d47a9

v0.9.3

Added support for running checks on multiple tables (#566). Added more flexibility and functionality in running data quality checks, allowing users to run checks on multiple tables in a single method call and as part of Workflows execution. Provided options to run checks for all configured run configs or for a specific run config, or for tables/views matching wildcard patterns. The CLI commands for running workflows have been updated to reflect and support these new functionalities. Additionally, new parameters have been added to configuration file to control the level of parallelism for these operations, such as profiler_max_parallelism and quality_checker_max_parallelism. A new demo has been added to showcases how to use the profiler and apply checks across multiple tables. The changes aim to improve scalability of DQX.
Added New Row-level Checks: IPv6 Address Validation (#578). DQX now includes 2 new row-level checks: validation of IPv6 address (is_valid_ipv6_address check function), and validation if IPv6 address is within provided CIDR block (is_ipv6_address_in_cidr check function).
Added New Dataset-level Check: Schema Validation check (#568). The has_valid_schema check function has been introduced to validate whether a DataFrame conforms to a specified schema, with results reported at the row level for consistency with other checks. This function can operate in non-strict mode, where it verifies the existence of expected columns with compatible types, or in strict mode, where it enforces an exact schema match, including column order and types. It accepts parameters such as the expected schema, which can be defined as a DDL string or a StructType object, and optional arguments to specify columns to validate and strict mode.
Added New Row-level Checks: Spatial data validations (#581). Specialized data validation checks for geospatial data have been introduced, enabling verification of valid latitude and longitude values, various geometry and geography types, such as points, linestrings, polygons, multipoints, multilinestrings, and multipolygons, as well as checks for Open Geospatial Consortium (OGC) validity, non-empty geometries, and specific dimensions or coordinate ranges. These checks are implemented as check functions, including is_latitude, is_longitude, is_geometry, is_geography, is_point, is_linestring, is_polygon, is_multipoint, is_multilinestring, is_multipolygon, is_ogc_valid, is_non_empty_geometry, has_dimension, has_x_coordinate_between, and has_y_coordinate_between. The addition of these geospatial data validation checks enhances the overall data quality capabilities, allowing for more accurate and reliable geospatial data processing and analysis. Running these checks requires Databricks serverless or cluster with runtime 17.1 or above.
Added absolute and relative tolerance to comparison of datasets (#574). The compare_datasets check has been enhanced with the introduction of absolute and relative tolerance parameters, enabling more flexible comparisons of decimal values. These tolerances can be applied to numeric columns.
Added detailed telemetry (#561). Telemetry has been enhanced across multiple functionalities to provide better visibility into DQX usage, including which features and checks are used most frequently. This will help us focus development efforts on the areas that matter most to our users.
Allow installation in a custom folder (#575). The installation process for the library has been enhanced to offer flexible installation options, allowing users to install the library in a custom workspace folder, in addition to the default user home directory or a global folder. When installing DQX as a workspace tool using the Databricks CLI, users are prompted to optionally specify a custom workspace path for the installation. Allowing custom installation folder makes it possible to use DQX on group assigned cluster.
Profile subset dataframe (#589). The data profiling feature has been enhanced to allow users to profile and generate rules on a subset of the input data by introducing a filter option, which is a string SQL expression that can be used to filter the input data. This filter can be specified in the configuration file or when using the profiler, providing more flexibility in analyzing subsets of data. The profiler supports extensive configuration options to customize the profiling process, including sampling, limiting, and computing statistics on the sampled data. The new filter option enables users to generate more targeted and relevant rules, and it can be used to focus on particular segments of the data, such as rows that match certain conditions.
Added custom exceptions (#582). The codebase now utilizes custom exceptions to handle various error scenarios, providing more specific and informative error messages compared to generic exceptions.

BREAKING CHANGES!

Workflows run by default for all run configs from configuration file. Previously, the default behaviour was to run them for a specific run config only.
The following depreciated methods are removed from the DQEngine: load_checks_from_local_file, load_checks_from_workspace_file, load_checks_from_table, load_checks_from_installation, save_checks_in_local_file, save_checks_in_workspace_file, save_checks_in_table,, save_checks_in_installation, load_run_config. For loading and saving checks, users are advised to use load_checks and save_checks of the DQEngine described here, which support various storage types.

Contributors: @mwojtyczka, @ghanse, @tdikland, @Divya-Kovvuru-0802, @cornzyblack, @STEFANOVIVAS

Contributors

cornzyblack, mwojtyczka, and 4 other contributors

Assets 4

05 Sep 19:59

github-actions

v0.9.2

a22ab79

v0.9.2

Added performance benchmarks
(#548). Performance tests are run to ensure performance does not degrade by more than 25% by any change. Benchmark results are published in the documentation in the reference section. The benchmark covers all check functions, running all funcitons at once and applying the same funcitons at once for multiple columns using foreach column. A new performance GitHub workflow has been introduced to automate performance benchmarking, generating a new benchmark baseline, updating the existing baseline, and running performance tests to compare with the baseline.
Declare readme in the project (#547). The project configuration has been updated to include README file in the released package so that it is visible in PyPi.
Fixed deserializing to DataFrame to assign columns properly (#559). The deserialize_checks_to_dataframe function has been enhanced to correctly handle columns for sql_expression by removing the unnecessary check for DQDatasetRule instance and directly verifying if dq_rule_check.columns is not None.
Fixed lsql dependency (#564). The lsql dependency has been updated to address a sqlglot dependency issue that arises when imported in artifacts repositories.

Contributors: @mwojtyczka @ghanse @cornzyblack @gchandra10

Contributors

cornzyblack, mwojtyczka, and 2 other contributors

Assets 4

25 Aug 10:56

mwojtyczka

v0.9.1

ee802c4

v0.9.1

0.9.1

Added quality checker and end to end workflows (#519). This release introduces no-code solution for applying checks. The following workflows were added: quality-checker (apply checks and save results to tables) and end-to-end (e2e) workflows (profile input data, generate quality checks, apply the checks, save results to tables). The workflows enable quality checking for data at-rest without the need for code-level integration. It supports reference data for checks using tables (e.g., required by foreign key or compare datasets checks) as well as custom python check functions (mapping of custom check funciton to the module path in the workspace or Unity Catalog volume containing the function definition). The workflows handle one run config for each job run. Future release will introduce functionality to execute this across multiple tables. In addition, CLI commands have been added to execute the workflows. Additionaly, DQX workflows are configured now to execute using serverless clusters, with an option to use standards clusters as well. InstallationChecksStorageHandler now support absolute workspace path locations.
Added built-in row-level check for PII detection (#486). Introduced a new built-in check for Personally Identifiable Information (PII) detection, which utilizes the Presidio framework and can be configured using various parameters, such as NLP entity recognition configuration. This check can be defined using the does_not_contain_pii check function and can be customized to suit specific use cases. The check requires pii extras to be installed: pip install databricks-labs-dqx[pii]. Furthermore, a new enum class NLPEngineConfig has been introduced to define various NLP engine configurations for PII detection. Overall, these updates aim to provide more robust and customizable quality checking capabilities for detecting PII data.
Added equality row-level checks (#535). Two new row-level checks, is_equal_to and is_not_equal_to, have been introduced to enable equality checks on column values, allowing users to verify whether the values in a specified column are equal to or not equal to a given value, which can be a numeric literal, column expression, string literal, date literal, or timestamp literal.
Added demo for Spark Structured Streaming (#518). Added demo to showcase usage of DQX with Spark Structured Streaming for in-transit data quality checking. The demo is available as Databricks notebook, and can be run on any Databricks workspace.
Added clarification to profiler summary statistics (#523). Added new section on understanding summary statistics, which explains how these statistics are computed on a sampled subset of the data and provides a reference for the various summary statistics fields.
Fixed rounding datetimes in the checks generator (#517). The generator has been enhanced to correctly handle midnight values when rounding "up", ensuring that datetime values already at midnight remain unchanged, whereas previously they were rounded to the next day.
Added API Docs (#520). The DQX API documentation is generated automatically using docstrings. As part of this change the library's documentation has been updated to follow Google style.
Improved test automation by adding end-to-end test for the asset bundles demo (#533).

BREAKING CHANGES!

ExtraParams was moved from databricks.labs.dqx.rule module to databricks.labs.dqx.config

Contributors: @mwojtyczka @ghanse @renardeinside @cornzyblack @bsr-the-mngrm @dinbab1984 @AdityaMandiwal

Contributors

renardeinside, cornzyblack, and 5 other contributors

Assets 4

06 Aug 22:18

mwojtyczka

v0.8.0

fe5d6a3

v0.8.0

Added new row-level freshness check (#495). A new data quality check function, is_data_fresh, has been introduced to identify stale data resulting from delayed pipelines, enabling early detection of upstream issues. This function assesses whether the values in a specified timestamp column are within a specified number of minutes from a base timestamp column. The function takes three parameters: the column to check, the maximum age in minutes before data is considered stale, and an optional base timestamp column, defaulting to the current timestamp if not provided.
Added new dataset-level freshess check (#499). A new dataset-level check function, is_data_fresh_per_time_window, has been added to validate whether at least a specified minimum number of records arrive within every specified time window, ensuring data freshness. This function is customizable, allowing users to define the time window, minimum records per window, and lookback period.
Improvements have been made to the performance of aggregation check functions, and the check message format has been updated for better readability.
Created llm util function to get check functions details (#469). A new utility function has been introduced to provide definitions of all check functions, enabling the generation of prompts for Large Language Models (LLMs) to create check functions.
Added equality safe row and column matching in compare datasets check (#473). The compare datasets check functionality has been enhanced to handle null values during row matching and column value comparisons, improving its robustness and flexibility. Two new optional parameters, null_safe_row_matching and null_safe_column_value_matching, have been introduced to control how null values are handled, both defaulting to True. These parameters allow for null-safe primary key matching and column value matching, ensuring accurate comparison results even when null values are present in the data. The check now excludes specific columns from value comparison using the exclude_columns parameter while still considering them for row matching.
Fixed datetime rounding logic in profiler (#483). The datetime rounding logic has been improved in profiler to respect the round=False option, which was previously ignored. The code now handles the OverflowError that occurs when rounding up the maximum datetime value by capping the result and logging a warning.
Added loading and saving checks from file in Unity Catalog Volume (#512). This change introduces support for storing quality checks in a Unity Catalog Volume, in addition to existing storage types such as tables, files, and workspace files. The storage location of quality checks has been unified into a single configuration field called checks_location, replacing the previous checks_file and checks_table fields, to simplify the configuration and remove ambiguity by ensuring only one storage location can be defined per run configuration. The checks_location field can point to a file in the local path, workspace, installation folder, or Unity Catalog Volume, providing users with more flexibility and clarity when managing their quality checks.
Refactored methods for loading and saving checks (#487). The DQEngine class has undergone significant changes to improve modularity and maintainability, including the unification of methods for loading and saving checks under the load_checks and save_checks methods, which take a config parameter to determine the storage type, such as FileChecksStorageConfig, WorkspaceFileChecksStorageConfig, TableChecksStorageConfig, or InstallationChecksStorageConfig.
Storing checks using dqx classes (#474). The data quality engine has been enhanced with methods to convert quality checks between DQRule objects and Python dictionaries, allowing for flexibility in check definition and usage. The serialize_checks method converts a list of DQRule instances into a dictionary representation, while the deserialize_checks method performs the reverse operation, converting a dictionary representation back into a list of DQRule instances. Additionally, the DQRule class now includes a to_dict method to convert a DQRule instance into a structured dictionary, providing a standardized representation of the rule's metadata. These changes enable users to work with checks in both formats, store and retrieve checks easily, and improve the overall management and storage of data quality checks. The conversion process supports local execution and handles non-complex column expressions, although complex PySpark expressions or Python functions may not be fully reconstructable when converting from class to metadata format.
Added llm utility funciton to extract checks examples in yaml from docs (#506). This is achieved through a new Python script that extracts YAML examples from MDX documentation files and creates a combined YAML file with all the extracted examples. The script utilizes regular expressions to extract YAML code blocks from MDX content, validates each YAML block, and combines all valid blocks into a single list. The combined YAML file is then created in the LLM resources directory for use in language model processing.

BREAKING CHANGES!

The checks_file and checks_table fields have been removed from the installation run configuration. They are now consolidated into the single checks_location field. This change simplifies the configuration and clearly defines where checks are stored.
The load_run_config method has been moved to config_loader.RunConfigLoader, as it is not intended for direct use and falls outside the DQEngine core responsibilities.

DEPRECIATION CHANGES!

If you are loading or saving checks from a storage (file, workspace file, table, installation), you are affected. We are deprecating the below methods. We are keeping the methods in the DQEngine but you should update your code as these methods will be removed in future versions.

Loading checks to storage has been unified under load_checks method. The following methods have been removed from the DQEngine:
load_checks_from_local_file, load_checks_from_workspace_file, load_checks_from_installation, load_checks_from_table.
Saving checks in storage has been unified under load_checks method. The following methods have been removed from the DQEngine:
save_checks_in_local_file, save_checks_in_workspace_file, save_checks_in_installation, save_checks_in_table.

The save_checks and load_checks take config as a parameter, which determines the storage types used. The following storage configs are currently supported:

FileChecksStorageConfig: file in the local filesystem (YAML or JSON)
WorkspaceFileChecksStorageConfig: file in the workspace (YAML or JSON)
TableChecksStorageConfig: a table
InstallationChecksStorageConfig: storage defined in the installation context, using either the checks_table or checks_file field from the run configuration.

Contributors: @mwojtyczka, @karthik-ballullaya-db, @bsr-the-mngrm, @ajinkya441, @cornzyblack, @ghanse, @jominjohny, @dinbab1984

Contributors

cornzyblack, mwojtyczka, and 6 other contributors

Assets 4

23 Jul 16:41

mwojtyczka

v0.7.1

6607b36

v0.7.1

Added type validation for apply checks method (#465). The library now enforces stricter type validation for data quality rules, ensuring all elements in the checks list are instances of DQRule. If invalid types are encountered, a TypeError is raised with a descriptive error message, suggesting alternative methods for passing checks as dictionaries. Additionally, input attribute validation has been enhanced to verify the criticality value, which must be either warn or "error", and raises a ValueError for invalid values.
Databricks Asset Bundle (DAB) demo (#443). A new demo showcasing the usage of DQX with DAB has been added.
Check to compare datasets (#463). A new dataset-level check, compare_datasets, has been introduced to compare two DataFrames at both row and column levels, providing detailed information about differences, including new or missing rows and column-level changes. This check compares only columns present in both DataFrames, excludes map type columns, and can be customized to exclude specific columns or perform a FULL OUTER JOIN to identify missing records. The compare_datasets check can be used with a reference DataFrame or table name, and its results include information about missing and extra rows, as well as a map of changed columns and their differences.
Demo on how to use DQX with dbt projects (#460). New demo has been added to showcase on how to use DQX with dbt transformation framework.
IP V4 address validation (#464). The library has been enhanced with new checks to validate IPv4 address. Two new row checks, is_valid_ipv4_address and is_ipv4_address_in_cidr, have been introduced to verify whether values in a specified column are valid IPv4 addresses and whether they fall within a given CIDR block, respectively.
Improved loading checks from Delta table (#462). Loading checks from Delta tables have been improved to eliminate the need to escape string arguments, providing a more robust and user-friendly experience for working with quality checks defined in Delta tables.

Contributors: @mwojtyczka, @cornzyblack, @ghanse, @grusin-db

Contributors

cornzyblack, mwojtyczka, and 2 other contributors

Assets 4

09 Jul 15:11

mwojtyczka

v0.7.0

2130189

v0.7.0

Added end-to-end quality checking methods (#364). The library now includes end-to-end quality checking methods, allowing users to read data from a table or view, apply checks, and write the results to a table. The DQEngine class has been updated to utilize InputConfig and OutputConfig objects to handle input and output configurations, providing more flexibility in the quality checking flow. The apply_checks_and_write_to_table and apply_checks_by_metadata_and_write_to_table methods have been introduced to support this functionality, applying checks using DQX classes and configuration, respectively. Additionally, the profiler configuration options have been reorganized into input_config and profiler_config sections, making it easier to understand and customize the profiling process. The changes aim to provide a more streamlined and efficient way to perform end-to-end quality checking and data validation, with improved configuration flexibility and readability.
Added equality checks for aggregate values and negate option for Foreign Key (#387). The library now includes two new checks, is_aggr_equal and is_aggr_not_equal, which enable users to perform equality checks on aggregate values, such as count, sum, average, minimum, and maximum, allowing verification that an aggregation on a column or group of columns is equal to or not equal to a specified limit. These checks can be configured with a criticality level of either error or warn and can be applied to specific columns or groups of columns. Additionally, the foreign_key check has been updated with a negate option, allowing the condition to be negated so that the check fails when the foreign key values exist in the reference dataframe or table, rather than when they do not exist. This expanded functionality enhances the library's data quality checking capabilities, providing more flexibility and power in validating data integrity.
Extend options for profiling multiple tables (#420). The profiler now supports wildcard patterns for profiling multiple tables, replacing the previous regex pattern support, and options can be passed as a list of dictionaries to apply different options to each table based on pattern matching. The profiler job is setup now with IO cache enabled cluste.
Improved quick demo to showcase defining checks using DQX classes and renamed DLT into Lakeflow Pipeline in docs (#399). DLT has been renamed to Lakeflow Pipeline in documentation and docstrings, to maintain consistency in terminology. The quick demo has been enhanced to showcase defining checks using DQX classes, providing a more comprehensive approach to data quality validation. Additionally, performance information related to dataset-level checks has been added to the documentation, and instructions on how to use the Environment to install DQX in Lakeflow Pipelines have been provided.
Populate columns in the results from kwargs of the check if provided (#416). Additionally, the sql_expression now supports optional columns argument that is propagated to the results.

BREAKING CHANGES!

Only users using config.yml are affected. The breaking change is for the required field: output_location which is now stored inside input_config.location.
The structure of the config.yml has been improved to include configs for input, output and quarantine data instead of specifying the fields directly. New fields related to profiler cluster and streaming has also been included as well. The installer has been updated accordingly.

Full example:

log_level: INFO
version: 1

profiler_override_clusters:          # <- optional dictionary mapping job cluster names to existing cluster IDs
  main: your-existing-cluster-id  # <- existing cluster Id to use
profiler_spark_conf:                    # <- optional spark configuration to use for the profiler job
  spark.sql.ansi.enabled: true

run_configs:

- name: default                     # <- unique name of the run config (default used during installation)

  input_config:                     # <- optional input data configuration
    location: s3://iot-ingest/raw   # <- input location of the data (table or cloud path)
    format: delta                   # <- format, required if cloud path provided
    is_streaming: false             # <- whether the input data should be read using streaming (default is false)
    schema: col1 int, col2 string   # <- schema of the input data (optional), applicable if reading csv and json files
    options:                        # <- additional options for reading from the input location (optional)
      versionAsOf: '0'

  output_config:                    # <- output data configuration
    location: main.iot.silver       # <- output location (table), used as input for quality dashboard ir quarantine locaiton is not provided
    format: delta                   # <- format of the output table
    mode: append                    # <- write mode for the output table (append or overwrite)
    options:                        # <- additional options for writing to the output table (optional)
      mergeSchema: 'true'
      #checkpointLocation: /Volumes/catalog1/schema1/checkpoint  # <- only applicable if input_config.is_streaming is enabled
    trigger:                        # <- streaming trigger, only applicable if input_config.is_streaming is enabled
      availableNow: true

  quarantine_config:                     # <- quarantine data configuration, if specified, bad data is written to quarantine table
    location: main.iot.silver_quarantine # <- quarantine location (table), used as input for quality dashboard
    format: delta                        # <- format of the quarantine table
    mode: append                         # <- write mode for the quarantine table (append or overwrite)
    options:                             # <- additional options for writing to the quarantine table (optional)
      mergeSchema: 'true'
      #checkpointLocation: /Volumes/catalog1/schema1/checkpoint  # <- only applicable if input_config.is_streaming is enabled
    trigger:                             # <- streaming trigger, only applicable if input_config.is_streaming is enabled
      availableNow: true

  checks_file: iot_checks.yml            # <- relative location of the quality rules (checks) defined in json or yaml file
  checks_table: main.iot.checks          # <- table storing the quality rules (checks)

  profiler_config:                            # <- profiler configuration
    summary_stats_file: iot_summary_stats.yml # <- relative location of profiling summary stats
    sample_fraction: 0.3                      # <- fraction of data to sample in the profiler (30%)
    sample_seed: 30                           # <- optional seed for reproducible sampling
    limit: 1000                               # <- limit the number of records to profile

  warehouse_id: your-warehouse-id         # <- warehouse id for refreshing dashboard

- name: another_run_config                # <- unique name of the run config
  ...

Contributors: @mwojtyczka, @ghanse

Contributors

mwojtyczka and ghanse

Assets 4

26 Jun 12:13

github-actions

v0.6.0

54ec2b3

v0.6.0

Release v0.6.0 (#395)

Added Dataset-level checks, Foreign Key and SQL Script checks (#375). The data quality library has been enhanced with the introduction of dataset-level checks, which allow users to apply quality checks at the dataset level, in addition to existing row-level checks. Similar to row-level checks, the results of the dataset-level quality checks are reported for each individual row in the result columns. A new DQDatasetRule class has been added to define dataset-level checks, and several new check functions have been added including the foreign_key and sql_query dataset-level checks. The library now also supports custom dataset-level checks using arbitrary SQL queries and provides the ability to define checks on multiple DataFrames or Tables. The DQEngine class has been modified to optionally accept Spark session as a parameter in its constructor, allowing users to pass their own Spark session. Major internal refactorization has been carried out to improve code maintenance and structure.
Added Demo from Data and AI Summit 2025 - DQX Demo for Manufacturing Industry (#391).
Added Github Action to check if all commits are signed (#392). The library now includes a GitHub Action that automates the verification of signed commits in pull requests, enhancing the security and integrity of the codebase. This action checks each commit in a pull request to ensure it is signed using the git commit -S command, and if any unsigned commits are found, it posts a comment with instructions on how to properly sign
commits.
Added methods to profile multiple tables (#374). The data profiling feature has been significantly enhanced with the introduction of two new methods, profile_table and profile_tables, which enable direct profiling of Delta tables, allowing users to generate summary statistics and candidate data quality rules. These methods provide a convenient way to profile data stored in Delta tables, with profile_table generating a profile from a single Delta table and profile_tables generating profiles from multiple Delta tables using explicit table lists or regex patterns for inclusion and exclusion. The profiling process is highly customizable, supporting extensive configuration options such as sampling, outlier detection, null value handling, and string handling. The generated profiles can be used to create Delta Live Tables expectations for enforcing data quality rules, and the profiling results can be stored in a table or file as YAML or JSON for easy management and reuse.
Created a quick start demo (#367). A new demo notebook has been introduced to provide a quickstart guide for utilizing the library, enabling users to easily test features using the Databricks Power Tools and Databricks Extension in VS Code. This demo notebook showcases both configuration styles side by side, applying the same rules for direct comparison, and includes a small, hardcoded sample dataset for quick experimentation, designed to be executed cell-by-cell in VS Code. The addition of this demo aims to help new users understand and compare both configuration approaches in a practical context, facilitating a smoother onboarding experience.
Pin GitHub URLs in docs to the latest released version (#390). The formatting process now includes an additional step to update GitHub URLs, ensuring they point to the latest released version instead of the main branch, which helps prevent access to unreleased changes. This update is automated during the release process and allows users to review changes before committing.

BREAKING CHANGES!

Moved existing is_unique, is_aggr_not_greater_than and is_aggr_not_less_than checks under dataset-level checks umbrella. These checks must be defined using DQDatasetRule class and not DQRowRule anymore. Input parameters remain the same as before. This is a breaking change for checks defined using DQX classes. Yaml/Json definitions are not affected.
DQRowRuleForEachCol has been renamed to DQForEachColRule to make it generic and handle both row and dataset level rules.
Renamed column_names to result_column_names in the ExtraParams for clarity as they may be confused with column(s) specified for the rules itself. This is a breaking change!

Contributors: @mwojtyczka @alexott @nehamilak-db @ghanse @Escoto

Contributors

alexott, mwojtyczka, and 3 other contributors

Assets 4

06 Jun 17:08

github-actions

v0.5.0

fcb37f9

v0.5.0

What's Changed

Fix spark remote version detection in CI by @mwojtyczka in #342
Fix spark remote installation by @mwojtyczka in #346
Load and save checks from a Delta table by @ghanse in #339
Handle nulls in uniqueness check for composite keys to conform with the SQL ANSI Standard by @mwojtyczka in #345

Corrected is_unique check to consistently handle nulls. To conform with the SQL ANSI Standard), null values are treated now as unknown, thus not duplicates, e.g. "(NULL, NULL) not equals (NULL, NULL); (1, NULL) not equals (1, NULL). A new parameter called nulls_distinct has been added to the check which is set to True by default. If set to False, null values are treated as duplicates, e.g. eg. (1, NULL) equals (1, NULL) and (NULL, NULL) equals (NULL, NULL)
Added columns parameter to the is_unique check to be able to pass list of columns to handle composite keys.

Allow user metadata for individual checks by @ghanse in #352
Add save_results_in_table function by @karthik-ballullaya-db in #319
Fix older than by @ghanse in #354
Add PII-detection example by @ghanse in #358
Add aggregation type of checks (count/min/max/sum/avg/mean over a group of records) by @mwojtyczka in #357
Updated documentation to provide guidance on applying quality checks to multiple data sets.

BREAKING CHANGES AND HOW TO UPGRADE!

This release introduces changes to the syntax of quality rules to ensure we can scale DQX functionality in the future. Please update your quality rules accordingly.

Renamed col_name field in the checks to column to better reflect the fact that it can be a column name or column expression.

From:

- check:
    function: is_not_null
    arguments:
      col_name: col1

To:

- check:
    function: is_not_null
    arguments:
      column: col1

Similarly, custom checks must define column instead of col_name as first argument.
Renamed and changed data type of col_name to columns in the reporting columns. DQX stores list of columns instead of a single column now to handle checks that take multiple columns as input.
Renamed col_names to for_each_column and moved it from check function arguments the check level.

From:

- check:
    function: is_not_null
    arguments:
      col_names:
      - col1
      - col2

To:

- check:
    function: is_not_null
    for_each_column:
    - col1
    - col2

Renamed DQColRule to DQRowRule for clarity. This only affects checks defined using classes.
Renamed DQColSetRule to DQRowRuleForEachCol for clarity. This only affects checks defined using classes.
Renamed row_checks module containing check functions to check_funcs. This only affects checks defined using classes.

The documentation has been updated accordingly: https://databrickslabs.github.io/dqx/docs/reference/quality_rules/

Full Changelog: v0.4.0...v0.5.0

Contributors

mwojtyczka, karthik-ballullaya-db, and ghanse

Assets 4

08 Apr 22:14

github-actions

v0.4.0

d0ffb7a

v0.4.0

Added input spark options and schema for reading from the storage (#312). This commit enhances the data quality framework used for profiling and validating data in a Databricks workspace with new options and functionality for reading data from storage. It allows for the usage of input spark options and schema, and supports fully qualified Unity Catalog or Hive Metastore table names in the format of catalog.schema.table or schema.table. Additionally, the code now includes a new dataclass field, input_schema, and a new dictionary field, input_read_options, to the RunConfig class. The documentation is updated with examples of how to use the new functionality.
Added an example of uniqueness check for composite key (#312). Additionally, the code now includes a new dataclass field, input_schema, and a new dictionary field, input_read_options, to the RunConfig class. The documentation is updated with examples of how to use the new functionality.
Renamed row checks module for more clarity (#314). This change renames the col_check_functions module to row_checks for clarity and to distinguish it from other types of checks. The import * syntax is removed and unused imports are removed from the demo. This change requires updating import statements that reference col_check_functions to use the new name row_checks. Checks defined using DQX classes require a simple update.

Full Changelog: v0.3.1...v0.4.0

Assets 4

Releases: databrickslabs/dqx

v0.10.0

Contributors

Uh oh!

v0.9.3

Contributors

Uh oh!

v0.9.2

Contributors

Uh oh!

v0.9.1

0.9.1

Contributors

Uh oh!

v0.8.0

Contributors

Uh oh!

v0.7.1

Contributors

Uh oh!

v0.7.0

Contributors

Uh oh!

v0.6.0

Contributors

Uh oh!

v0.5.0

What's Changed

BREAKING CHANGES AND HOW TO UPGRADE!

Contributors

Uh oh!

v0.4.0

Uh oh!