Releases: databrickslabs/dqx
v0.10.0
- Added Data Quality Summary Metrics (#553). The data quality engine has been enhanced with the ability to track and manage summary metrics for data quality validation, leveraging Spark's Observation feature. A new
DQMetricsObserverclass has been introduced to manage Spark observations and track summary metrics on datasets checked with the engine. TheDQEngineclass has been updated to optionally return the Spark observation associated with a given run, allowing users to access and save summary metrics. The engine now supports also writing summary metrics to a table using themetrics_configparameter, and a newsave_summary_metricsmethod has been added to save data quality summary metrics to a table. Additionally, the engine has been updated to include a uniquerun_idfield in the detailed per-row quality results, enabling cross-referencing with summary metrics. The changes also include updates to the configuration file to support the storage of summary metrics. Overall, these enhancements provide a more comprehensive and flexible data quality checking capability, allowing users to track and analyze data quality issues more effectively. - LLM assisted rules generation (#577). This release introduces a significant enhancement to the data quality rules generation process with the integration of AI-assisted rules generation using large language models (LLMs). The
DQGeneratorclass now includes agenerate_dq_rules_ai_assistedmethod, which takes user input in natural language and optionally a schema from an input table to generate data quality rules. These rules are then validated for correctness. The AI-assisted rules generation feature supports both programmatic and no-code approaches. Additionally, the feature enables the use of different LLM models and gives the possibility to use custom check functions. The release also includes various updates to the documentation, configuration files, and testing framework to support the new AI-assisted rules generation feature, ensuring a more streamlined and efficient process for defining and applying data quality rules. - Added Lakebase checks storage backend (#550). A Lakebase checks storage backend was added, allowing users to store and manage their data quality rules in a centralized lakabase table, in addition to the existing Delta table storage. The
checks_locationresolution has been updated to accommodate Lakebase, supporting both table and file storage, with flexible formatting options, including "catalog.schema.table" and "database.schema.table". The Lakebase checks storage backend is configurable through theLakebaseChecksStorageConfigclass, which includes fields for instance name, user, location, port, run configuration name, and write mode. This update provides users with more flexibility in storing and loading quality checks, ensuring that checks are saved correctly regardless of the specified location format. - Added runtime validation of sql expressions (#625). The data quality check functionality has been enhanced with runtime validation of SQL expressions, ensuring that specified fields can be resolved in the input DataFrame and that SQL expressions are valid before evaluation. If an SQL expression is invalid, the check evaluation is skipped and the results include a check failure with a descriptive message. Additionally, the configuration validation for Unity Catalog volume file paths has been improved to enforce a specific format, preventing invalid configurations and providing more informative error messages.
- Fixed docs (#598). The documentation build process has undergone significant improvements to enhance efficiency and maintainability.
- Improved Config Serialization (#676). Several updates have been made to improve the functionality, consistency, and maintainability of the codebase. The configuration loading functionality has been refactored to utilize the
ConfigSerializerclass, which handles the serialization and deserialization of workspace and run configurations. - Restore use of
hatch-fancy-pypi-readmeto fix images in PyPi (#601). The image source path for the logo in the README has been modified to correctly display the logo image when rendered, particularly on PyPi. - Skip check evaluation if columns or filter cannot be resolved in the input DataFrame (#609). DQX now skip check evaluation if columns or filters are incorrect allowing other checks to proceed even if one rule fails. The DQX engine validates specified column, columns and filter fields against the input DataFrame before applying checks, skipping evaluation and providing informative error messages if any fields are invalid.
- Updated user guide docs (#607). The documentation for quality checking and integration options has been updated to provide accurate and detailed information on supported types and approaches. Quality checking can be performed in-transit (pre-commit), validating data on the fly during processing, or at-rest, checking existing data stored in tables.
- Improved build process (#618). The hatch version has been updated to 1.15.0 to avoid compatibility issues with click version 8.3 and later, which introduced a bug affecting hatch. Additionally, the project's dependencies have been updated, including bumping the
databricks-labs-pytesterversion from 0.7.2 to 0.7.4, and code refactoring has been done to use a single Lakebase instance for all integration tests, with retry logic added to handle cases where the workspace quota limit for the number of Lakebase instances is exceeded, enhancing the testing infrastructure and improving test reliability. Furthermore, documentation updates have been made to clarify the application of quality checks to data using DQX. These changes aim to improve the efficiency, reliability, and clarity of the project's testing and documentation infrastructure.
BREAKING CHANGES!
- Added new field run_id to the detailed per-row quality results. This may or may not be a breaking change for you depending on how you leverage the results today. This is a unique run ID recorded in the summary metrics as well as detailed quality checking results to enable cross-referencing. When reusing the same DQEngine instance, the run ID stays the same. Each apply checks execution does not generate a new run ID for the same instance. It is only changed when new engine and observer (if using one) is created.
LIMITATIONS
- Saving metrics to a table requires using a classic compute cluster in Dedicated Access Mode. This limitation will be lifted observations issue is fixed in Spark Connect.
Contributors: @mwojtyczka, @ghanse, @souravg-db2, @alexott, @tlgnr
v0.9.3
- Added support for running checks on multiple tables (#566). Added more flexibility and functionality in running data quality checks, allowing users to run checks on multiple tables in a single method call and as part of Workflows execution. Provided options to run checks for all configured run configs or for a specific run config, or for tables/views matching wildcard patterns. The CLI commands for running workflows have been updated to reflect and support these new functionalities. Additionally, new parameters have been added to configuration file to control the level of parallelism for these operations, such as
profiler_max_parallelismandquality_checker_max_parallelism. A new demo has been added to showcases how to use the profiler and apply checks across multiple tables. The changes aim to improve scalability of DQX. - Added New Row-level Checks: IPv6 Address Validation (#578). DQX now includes 2 new row-level checks: validation of IPv6 address (
is_valid_ipv6_addresscheck function), and validation if IPv6 address is within provided CIDR block (is_ipv6_address_in_cidrcheck function). - Added New Dataset-level Check: Schema Validation check (#568). The
has_valid_schemacheck function has been introduced to validate whether a DataFrame conforms to a specified schema, with results reported at the row level for consistency with other checks. This function can operate in non-strict mode, where it verifies the existence of expected columns with compatible types, or in strict mode, where it enforces an exact schema match, including column order and types. It accepts parameters such as the expected schema, which can be defined as a DDL string or a StructType object, and optional arguments to specify columns to validate and strict mode. - Added New Row-level Checks: Spatial data validations (#581). Specialized data validation checks for geospatial data have been introduced, enabling verification of valid latitude and longitude values, various geometry and geography types, such as points, linestrings, polygons, multipoints, multilinestrings, and multipolygons, as well as checks for Open Geospatial Consortium (OGC) validity, non-empty geometries, and specific dimensions or coordinate ranges. These checks are implemented as check functions, including
is_latitude,is_longitude,is_geometry,is_geography,is_point,is_linestring,is_polygon,is_multipoint,is_multilinestring,is_multipolygon,is_ogc_valid,is_non_empty_geometry,has_dimension,has_x_coordinate_between, andhas_y_coordinate_between. The addition of these geospatial data validation checks enhances the overall data quality capabilities, allowing for more accurate and reliable geospatial data processing and analysis. Running these checks requires Databricks serverless or cluster with runtime 17.1 or above. - Added absolute and relative tolerance to comparison of datasets (#574). The
compare_datasetscheck has been enhanced with the introduction of absolute and relative tolerance parameters, enabling more flexible comparisons of decimal values. These tolerances can be applied to numeric columns. - Added detailed telemetry (#561). Telemetry has been enhanced across multiple functionalities to provide better visibility into DQX usage, including which features and checks are used most frequently. This will help us focus development efforts on the areas that matter most to our users.
- Allow installation in a custom folder (#575). The installation process for the library has been enhanced to offer flexible installation options, allowing users to install the library in a custom workspace folder, in addition to the default user home directory or a global folder. When installing DQX as a workspace tool using the Databricks CLI, users are prompted to optionally specify a custom workspace path for the installation. Allowing custom installation folder makes it possible to use DQX on group assigned cluster.
- Profile subset dataframe (#589). The data profiling feature has been enhanced to allow users to profile and generate rules on a subset of the input data by introducing a filter option, which is a string SQL expression that can be used to filter the input data. This filter can be specified in the configuration file or when using the profiler, providing more flexibility in analyzing subsets of data. The profiler supports extensive configuration options to customize the profiling process, including sampling, limiting, and computing statistics on the sampled data. The new filter option enables users to generate more targeted and relevant rules, and it can be used to focus on particular segments of the data, such as rows that match certain conditions.
- Added custom exceptions (#582). The codebase now utilizes custom exceptions to handle various error scenarios, providing more specific and informative error messages compared to generic exceptions.
BREAKING CHANGES!
- Workflows run by default for all run configs from configuration file. Previously, the default behaviour was to run them for a specific run config only.
- The following depreciated methods are removed from the
DQEngine:load_checks_from_local_file,load_checks_from_workspace_file,load_checks_from_table,load_checks_from_installation,save_checks_in_local_file,save_checks_in_workspace_file,save_checks_in_table,,save_checks_in_installation,load_run_config. For loading and saving checks, users are advised to useload_checksandsave_checksof theDQEnginedescribed here, which support various storage types.
Contributors: @mwojtyczka, @ghanse, @tdikland, @Divya-Kovvuru-0802, @cornzyblack, @STEFANOVIVAS
v0.9.2
- Added performance benchmarks
(#548). Performance tests are run to ensure performance does not degrade by more than 25% by any change. Benchmark results are published in the documentation in the reference section. The benchmark covers all check functions, running all funcitons at once and applying the same funcitons at once for multiple columns using foreach column. A new performance GitHub workflow has been introduced to automate performance benchmarking, generating a new benchmark baseline, updating the existing baseline, and running performance tests to compare with the baseline. - Declare readme in the project (#547). The project configuration has been updated to include README file in the released package so that it is visible in PyPi.
- Fixed deserializing to DataFrame to assign columns properly (#559). The
deserialize_checks_to_dataframefunction has been enhanced to correctly handle columns forsql_expressionby removing the unnecessary check forDQDatasetRuleinstance and directly verifying ifdq_rule_check.columnsis notNone. - Fixed lsql dependency (#564). The lsql dependency has been updated to address a sqlglot dependency issue that arises when imported in artifacts repositories.
Contributors: @mwojtyczka @ghanse @cornzyblack @gchandra10
v0.9.1
0.9.1
- Added quality checker and end to end workflows (#519). This release introduces no-code solution for applying checks. The following workflows were added: quality-checker (apply checks and save results to tables) and end-to-end (e2e) workflows (profile input data, generate quality checks, apply the checks, save results to tables). The workflows enable quality checking for data at-rest without the need for code-level integration. It supports reference data for checks using tables (e.g., required by foreign key or compare datasets checks) as well as custom python check functions (mapping of custom check funciton to the module path in the workspace or Unity Catalog volume containing the function definition). The workflows handle one run config for each job run. Future release will introduce functionality to execute this across multiple tables. In addition, CLI commands have been added to execute the workflows. Additionaly, DQX workflows are configured now to execute using serverless clusters, with an option to use standards clusters as well. InstallationChecksStorageHandler now support absolute workspace path locations.
- Added built-in row-level check for PII detection (#486). Introduced a new built-in check for Personally Identifiable Information (PII) detection, which utilizes the Presidio framework and can be configured using various parameters, such as NLP entity recognition configuration. This check can be defined using the
does_not_contain_piicheck function and can be customized to suit specific use cases. The check requirespiiextras to be installed:pip install databricks-labs-dqx[pii]. Furthermore, a new enum classNLPEngineConfighas been introduced to define various NLP engine configurations for PII detection. Overall, these updates aim to provide more robust and customizable quality checking capabilities for detecting PII data. - Added equality row-level checks (#535). Two new row-level checks,
is_equal_toandis_not_equal_to, have been introduced to enable equality checks on column values, allowing users to verify whether the values in a specified column are equal to or not equal to a given value, which can be a numeric literal, column expression, string literal, date literal, or timestamp literal. - Added demo for Spark Structured Streaming (#518). Added demo to showcase usage of DQX with Spark Structured Streaming for in-transit data quality checking. The demo is available as Databricks notebook, and can be run on any Databricks workspace.
- Added clarification to profiler summary statistics (#523). Added new section on understanding summary statistics, which explains how these statistics are computed on a sampled subset of the data and provides a reference for the various summary statistics fields.
- Fixed rounding datetimes in the checks generator (#517). The generator has been enhanced to correctly handle midnight values when rounding "up", ensuring that datetime values already at midnight remain unchanged, whereas previously they were rounded to the next day.
- Added API Docs (#520). The DQX API documentation is generated automatically using docstrings. As part of this change the library's documentation has been updated to follow Google style.
- Improved test automation by adding end-to-end test for the asset bundles demo (#533).
BREAKING CHANGES!
ExtraParamswas moved fromdatabricks.labs.dqx.rulemodule todatabricks.labs.dqx.config
Contributors: @mwojtyczka @ghanse @renardeinside @cornzyblack @bsr-the-mngrm @dinbab1984 @AdityaMandiwal
v0.8.0
- Added new row-level freshness check (#495). A new data quality check function,
is_data_fresh, has been introduced to identify stale data resulting from delayed pipelines, enabling early detection of upstream issues. This function assesses whether the values in a specified timestamp column are within a specified number of minutes from a base timestamp column. The function takes three parameters: the column to check, the maximum age in minutes before data is considered stale, and an optional base timestamp column, defaulting to the current timestamp if not provided. - Added new dataset-level freshess check (#499). A new dataset-level check function,
is_data_fresh_per_time_window, has been added to validate whether at least a specified minimum number of records arrive within every specified time window, ensuring data freshness. This function is customizable, allowing users to define the time window, minimum records per window, and lookback period. - Improvements have been made to the performance of aggregation check functions, and the check message format has been updated for better readability.
- Created llm util function to get check functions details (#469). A new utility function has been introduced to provide definitions of all check functions, enabling the generation of prompts for Large Language Models (LLMs) to create check functions.
- Added equality safe row and column matching in compare datasets check (#473). The compare datasets check functionality has been enhanced to handle null values during row matching and column value comparisons, improving its robustness and flexibility. Two new optional parameters,
null_safe_row_matchingandnull_safe_column_value_matching, have been introduced to control how null values are handled, both defaulting to True. These parameters allow for null-safe primary key matching and column value matching, ensuring accurate comparison results even when null values are present in the data. The check now excludes specific columns from value comparison using theexclude_columnsparameter while still considering them for row matching. - Fixed datetime rounding logic in profiler (#483). The datetime rounding logic has been improved in profiler to respect the
round=Falseoption, which was previously ignored. The code now handles theOverflowErrorthat occurs when rounding up the maximum datetime value by capping the result and logging a warning. - Added loading and saving checks from file in Unity Catalog Volume (#512). This change introduces support for storing quality checks in a Unity Catalog Volume, in addition to existing storage types such as tables, files, and workspace files. The storage location of quality checks has been unified into a single configuration field called
checks_location, replacing the previouschecks_fileandchecks_tablefields, to simplify the configuration and remove ambiguity by ensuring only one storage location can be defined per run configuration. Thechecks_locationfield can point to a file in the local path, workspace, installation folder, or Unity Catalog Volume, providing users with more flexibility and clarity when managing their quality checks. - Refactored methods for loading and saving checks (#487). The
DQEngineclass has undergone significant changes to improve modularity and maintainability, including the unification of methods for loading and saving checks under theload_checksandsave_checksmethods, which take aconfigparameter to determine the storage type, such asFileChecksStorageConfig,WorkspaceFileChecksStorageConfig,TableChecksStorageConfig, orInstallationChecksStorageConfig. - Storing checks using dqx classes (#474). The data quality engine has been enhanced with methods to convert quality checks between
DQRuleobjects and Python dictionaries, allowing for flexibility in check definition and usage. Theserialize_checksmethod converts a list ofDQRuleinstances into a dictionary representation, while thedeserialize_checksmethod performs the reverse operation, converting a dictionary representation back into a list ofDQRuleinstances. Additionally, theDQRuleclass now includes ato_dictmethod to convert aDQRuleinstance into a structured dictionary, providing a standardized representation of the rule's metadata. These changes enable users to work with checks in both formats, store and retrieve checks easily, and improve the overall management and storage of data quality checks. The conversion process supports local execution and handles non-complex column expressions, although complex PySpark expressions or Python functions may not be fully reconstructable when converting from class to metadata format. - Added llm utility funciton to extract checks examples in yaml from docs (#506). This is achieved through a new Python script that extracts YAML examples from MDX documentation files and creates a combined YAML file with all the extracted examples. The script utilizes regular expressions to extract YAML code blocks from MDX content, validates each YAML block, and combines all valid blocks into a single list. The combined YAML file is then created in the LLM resources directory for use in language model processing.
BREAKING CHANGES!
- The
checks_fileandchecks_tablefields have been removed from the installation run configuration. They are now consolidated into the singlechecks_locationfield. This change simplifies the configuration and clearly defines where checks are stored. - The
load_run_configmethod has been moved toconfig_loader.RunConfigLoader, as it is not intended for direct use and falls outside theDQEnginecore responsibilities.
DEPRECIATION CHANGES!
If you are loading or saving checks from a storage (file, workspace file, table, installation), you are affected. We are deprecating the below methods. We are keeping the methods in the DQEngine but you should update your code as these methods will be removed in future versions.
- Loading checks to storage has been unified under
load_checksmethod. The following methods have been removed from theDQEngine:
load_checks_from_local_file,load_checks_from_workspace_file,load_checks_from_installation,load_checks_from_table. - Saving checks in storage has been unified under
load_checksmethod. The following methods have been removed from theDQEngine:
save_checks_in_local_file,save_checks_in_workspace_file,save_checks_in_installation,save_checks_in_table.
The save_checks and load_checks take config as a parameter, which determines the storage types used. The following storage configs are currently supported:
FileChecksStorageConfig: file in the local filesystem (YAML or JSON)WorkspaceFileChecksStorageConfig: file in the workspace (YAML or JSON)TableChecksStorageConfig: a tableInstallationChecksStorageConfig: storage defined in the installation context, using either thechecks_tableorchecks_filefield from the run configuration.
Contributors: @mwojtyczka, @karthik-ballullaya-db, @bsr-the-mngrm, @ajinkya441, @cornzyblack, @ghanse, @jominjohny, @dinbab1984
v0.7.1
- Added type validation for apply checks method (#465). The library now enforces stricter type validation for data quality rules, ensuring all elements in the checks list are instances of
DQRule. If invalid types are encountered, aTypeErroris raised with a descriptive error message, suggesting alternative methods for passing checks as dictionaries. Additionally, input attribute validation has been enhanced to verify the criticality value, which must be eitherwarnor "error", and raises aValueErrorfor invalid values. - Databricks Asset Bundle (DAB) demo (#443). A new demo showcasing the usage of DQX with DAB has been added.
- Check to compare datasets (#463). A new dataset-level check,
compare_datasets, has been introduced to compare two DataFrames at both row and column levels, providing detailed information about differences, including new or missing rows and column-level changes. This check compares only columns present in both DataFrames, excludes map type columns, and can be customized to exclude specific columns or perform a FULL OUTER JOIN to identify missing records. Thecompare_datasetscheck can be used with a reference DataFrame or table name, and its results include information about missing and extra rows, as well as a map of changed columns and their differences. - Demo on how to use DQX with dbt projects (#460). New demo has been added to showcase on how to use DQX with dbt transformation framework.
- IP V4 address validation (#464). The library has been enhanced with new checks to validate IPv4 address. Two new row checks,
is_valid_ipv4_addressandis_ipv4_address_in_cidr, have been introduced to verify whether values in a specified column are valid IPv4 addresses and whether they fall within a given CIDR block, respectively. - Improved loading checks from Delta table (#462). Loading checks from Delta tables have been improved to eliminate the need to escape string arguments, providing a more robust and user-friendly experience for working with quality checks defined in Delta tables.
Contributors: @mwojtyczka, @cornzyblack, @ghanse, @grusin-db
v0.7.0
- Added end-to-end quality checking methods (#364). The library now includes end-to-end quality checking methods, allowing users to read data from a table or view, apply checks, and write the results to a table. The
DQEngineclass has been updated to utilizeInputConfigandOutputConfigobjects to handle input and output configurations, providing more flexibility in the quality checking flow. Theapply_checks_and_write_to_tableandapply_checks_by_metadata_and_write_to_tablemethods have been introduced to support this functionality, applying checks using DQX classes and configuration, respectively. Additionally, the profiler configuration options have been reorganized intoinput_configandprofiler_configsections, making it easier to understand and customize the profiling process. The changes aim to provide a more streamlined and efficient way to perform end-to-end quality checking and data validation, with improved configuration flexibility and readability. - Added equality checks for aggregate values and negate option for Foreign Key (#387). The library now includes two new checks,
is_aggr_equalandis_aggr_not_equal, which enable users to perform equality checks on aggregate values, such as count, sum, average, minimum, and maximum, allowing verification that an aggregation on a column or group of columns is equal to or not equal to a specified limit. These checks can be configured with a criticality level of eithererrororwarnand can be applied to specific columns or groups of columns. Additionally, theforeign_keycheck has been updated with anegateoption, allowing the condition to be negated so that the check fails when the foreign key values exist in the reference dataframe or table, rather than when they do not exist. This expanded functionality enhances the library's data quality checking capabilities, providing more flexibility and power in validating data integrity. - Extend options for profiling multiple tables (#420). The profiler now supports wildcard patterns for profiling multiple tables, replacing the previous regex pattern support, and options can be passed as a list of dictionaries to apply different options to each table based on pattern matching. The profiler job is setup now with IO cache enabled cluste.
- Improved quick demo to showcase defining checks using DQX classes and renamed DLT into Lakeflow Pipeline in docs (#399).
DLThas been renamed toLakeflow Pipelinein documentation and docstrings, to maintain consistency in terminology. The quick demo has been enhanced to showcase defining checks using DQX classes, providing a more comprehensive approach to data quality validation. Additionally, performance information related to dataset-level checks has been added to the documentation, and instructions on how to use the Environment to install DQX in Lakeflow Pipelines have been provided. - Populate columns in the results from kwargs of the check if provided (#416). Additionally, the
sql_expressionnow supports optionalcolumnsargument that is propagated to the results.
BREAKING CHANGES!
- Only users using
config.ymlare affected. The breaking change is for the required field:output_locationwhich is now stored insideinput_config.location. - The structure of the config.yml has been improved to include configs for input, output and quarantine data instead of specifying the fields directly. New fields related to profiler cluster and streaming has also been included as well. The installer has been updated accordingly.
Full example:
log_level: INFO
version: 1
profiler_override_clusters: # <- optional dictionary mapping job cluster names to existing cluster IDs
main: your-existing-cluster-id # <- existing cluster Id to use
profiler_spark_conf: # <- optional spark configuration to use for the profiler job
spark.sql.ansi.enabled: true
run_configs:
- name: default # <- unique name of the run config (default used during installation)
input_config: # <- optional input data configuration
location: s3://iot-ingest/raw # <- input location of the data (table or cloud path)
format: delta # <- format, required if cloud path provided
is_streaming: false # <- whether the input data should be read using streaming (default is false)
schema: col1 int, col2 string # <- schema of the input data (optional), applicable if reading csv and json files
options: # <- additional options for reading from the input location (optional)
versionAsOf: '0'
output_config: # <- output data configuration
location: main.iot.silver # <- output location (table), used as input for quality dashboard ir quarantine locaiton is not provided
format: delta # <- format of the output table
mode: append # <- write mode for the output table (append or overwrite)
options: # <- additional options for writing to the output table (optional)
mergeSchema: 'true'
#checkpointLocation: /Volumes/catalog1/schema1/checkpoint # <- only applicable if input_config.is_streaming is enabled
trigger: # <- streaming trigger, only applicable if input_config.is_streaming is enabled
availableNow: true
quarantine_config: # <- quarantine data configuration, if specified, bad data is written to quarantine table
location: main.iot.silver_quarantine # <- quarantine location (table), used as input for quality dashboard
format: delta # <- format of the quarantine table
mode: append # <- write mode for the quarantine table (append or overwrite)
options: # <- additional options for writing to the quarantine table (optional)
mergeSchema: 'true'
#checkpointLocation: /Volumes/catalog1/schema1/checkpoint # <- only applicable if input_config.is_streaming is enabled
trigger: # <- streaming trigger, only applicable if input_config.is_streaming is enabled
availableNow: true
checks_file: iot_checks.yml # <- relative location of the quality rules (checks) defined in json or yaml file
checks_table: main.iot.checks # <- table storing the quality rules (checks)
profiler_config: # <- profiler configuration
summary_stats_file: iot_summary_stats.yml # <- relative location of profiling summary stats
sample_fraction: 0.3 # <- fraction of data to sample in the profiler (30%)
sample_seed: 30 # <- optional seed for reproducible sampling
limit: 1000 # <- limit the number of records to profile
warehouse_id: your-warehouse-id # <- warehouse id for refreshing dashboard
- name: another_run_config # <- unique name of the run config
...Contributors: @mwojtyczka, @ghanse
v0.6.0
Release v0.6.0 (#395)
- Added Dataset-level checks, Foreign Key and SQL Script checks (#375). The data quality library has been enhanced with the introduction of dataset-level checks, which allow users to apply quality checks at the dataset level, in addition to existing row-level checks. Similar to row-level checks, the results of the dataset-level quality checks are reported for each individual row in the result columns. A new
DQDatasetRuleclass has been added to define dataset-level checks, and several new check functions have been added including theforeign_keyandsql_querydataset-level checks. The library now also supports custom dataset-level checks using arbitrary SQL queries and provides the ability to define checks on multiple DataFrames or Tables. TheDQEngineclass has been modified to optionally accept Spark session as a parameter in its constructor, allowing users to pass their own Spark session. Major internal refactorization has been carried out to improve code maintenance and structure. - Added Demo from Data and AI Summit 2025 - DQX Demo for Manufacturing Industry (#391).
- Added Github Action to check if all commits are signed (#392). The library now includes a GitHub Action that automates the verification of signed commits in pull requests, enhancing the security and integrity of the codebase. This action checks each commit in a pull request to ensure it is signed using the
git commit -Scommand, and if any unsigned commits are found, it posts a comment with instructions on how to properly sign
commits. - Added methods to profile multiple tables (#374). The data profiling feature has been significantly enhanced with the introduction of two new methods,
profile_tableandprofile_tables, which enable direct profiling of Delta tables, allowing users to generate summary statistics and candidate data quality rules. These methods provide a convenient way to profile data stored in Delta tables, withprofile_tablegenerating a profile from a single Delta table andprofile_tablesgenerating profiles from multiple Delta tables using explicit table lists or regex patterns for inclusion and exclusion. The profiling process is highly customizable, supporting extensive configuration options such as sampling, outlier detection, null value handling, and string handling. The generated profiles can be used to create Delta Live Tables expectations for enforcing data quality rules, and the profiling results can be stored in a table or file as YAML or JSON for easy management and reuse. - Created a quick start demo (#367). A new demo notebook has been introduced to provide a quickstart guide for utilizing the library, enabling users to easily test features using the Databricks Power Tools and Databricks Extension in VS Code. This demo notebook showcases both configuration styles side by side, applying the same rules for direct comparison, and includes a small, hardcoded sample dataset for quick experimentation, designed to be executed cell-by-cell in VS Code. The addition of this demo aims to help new users understand and compare both configuration approaches in a practical context, facilitating a smoother onboarding experience.
- Pin GitHub URLs in docs to the latest released version (#390). The formatting process now includes an additional step to update GitHub URLs, ensuring they point to the latest released version instead of the main branch, which helps prevent access to unreleased changes. This update is automated during the release process and allows users to review changes before committing.
BREAKING CHANGES!
- Moved existing
is_unique,is_aggr_not_greater_thanandis_aggr_not_less_thanchecks under dataset-level checks umbrella. These checks must be defined usingDQDatasetRuleclass and notDQRowRuleanymore. Input parameters remain the same as before. This is a breaking change for checks defined using DQX classes. Yaml/Json definitions are not affected. DQRowRuleForEachColhas been renamed toDQForEachColRuleto make it generic and handle both row and dataset level rules.- Renamed
column_namestoresult_column_namesin theExtraParamsfor clarity as they may be confused with column(s) specified for the rules itself. This is a breaking change!
Contributors: @mwojtyczka @alexott @nehamilak-db @ghanse @Escoto
v0.5.0
What's Changed
- Fix spark remote version detection in CI by @mwojtyczka in #342
- Fix spark remote installation by @mwojtyczka in #346
- Load and save checks from a Delta table by @ghanse in #339
- Handle nulls in uniqueness check for composite keys to conform with the SQL ANSI Standard by @mwojtyczka in #345
Corrected is_unique check to consistently handle nulls. To conform with the SQL ANSI Standard), null values are treated now as unknown, thus not duplicates, e.g. "(NULL, NULL) not equals (NULL, NULL); (1, NULL) not equals (1, NULL). A new parameter called nulls_distinct has been added to the check which is set to True by default. If set to False, null values are treated as duplicates, e.g. eg. (1, NULL) equals (1, NULL) and (NULL, NULL) equals (NULL, NULL)
Added columns parameter to the is_unique check to be able to pass list of columns to handle composite keys.
- Allow user metadata for individual checks by @ghanse in #352
- Add save_results_in_table function by @karthik-ballullaya-db in #319
- Fix older than by @ghanse in #354
- Add PII-detection example by @ghanse in #358
- Add aggregation type of checks (count/min/max/sum/avg/mean over a group of records) by @mwojtyczka in #357
- Updated documentation to provide guidance on applying quality checks to multiple data sets.
BREAKING CHANGES AND HOW TO UPGRADE!
This release introduces changes to the syntax of quality rules to ensure we can scale DQX functionality in the future. Please update your quality rules accordingly.
- Renamed
col_namefield in the checks tocolumnto better reflect the fact that it can be a column name or column expression.
From:
- check:
function: is_not_null
arguments:
col_name: col1
To:
- check:
function: is_not_null
arguments:
column: col1
- Similarly, custom checks must define
columninstead ofcol_nameas first argument. - Renamed and changed data type of
col_nametocolumnsin the reporting columns. DQX stores list of columns instead of a single column now to handle checks that take multiple columns as input. - Renamed
col_namestofor_each_columnand moved it from check function arguments the check level.
From:
- check:
function: is_not_null
arguments:
col_names:
- col1
- col2
To:
- check:
function: is_not_null
for_each_column:
- col1
- col2
- Renamed
DQColRuletoDQRowRulefor clarity. This only affects checks defined using classes. - Renamed
DQColSetRuletoDQRowRuleForEachColfor clarity. This only affects checks defined using classes. - Renamed
row_checksmodule containing check functions tocheck_funcs. This only affects checks defined using classes.
The documentation has been updated accordingly: https://databrickslabs.github.io/dqx/docs/reference/quality_rules/
Full Changelog: v0.4.0...v0.5.0
v0.4.0
- Added input spark options and schema for reading from the storage (#312). This commit enhances the data quality framework used for profiling and validating data in a Databricks workspace with new options and functionality for reading data from storage. It allows for the usage of input spark options and schema, and supports fully qualified Unity Catalog or Hive Metastore table names in the format of catalog.schema.table or schema.table. Additionally, the code now includes a new dataclass field, input_schema, and a new dictionary field, input_read_options, to the RunConfig class. The documentation is updated with examples of how to use the new functionality.
- Added an example of uniqueness check for composite key (#312). Additionally, the code now includes a new dataclass field, input_schema, and a new dictionary field, input_read_options, to the RunConfig class. The documentation is updated with examples of how to use the new functionality.
- Renamed row checks module for more clarity (#314). This change renames the col_check_functions module to row_checks for clarity and to distinguish it from other types of checks. The import * syntax is removed and unused imports are removed from the demo. This change requires updating import statements that reference col_check_functions to use the new name row_checks. Checks defined using DQX classes require a simple update.
Full Changelog: v0.3.1...v0.4.0