v0.8.0
- Added new row-level freshness check (#495). A new data quality check function,
is_data_fresh, has been introduced to identify stale data resulting from delayed pipelines, enabling early detection of upstream issues. This function assesses whether the values in a specified timestamp column are within a specified number of minutes from a base timestamp column. The function takes three parameters: the column to check, the maximum age in minutes before data is considered stale, and an optional base timestamp column, defaulting to the current timestamp if not provided. - Added new dataset-level freshess check (#499). A new dataset-level check function,
is_data_fresh_per_time_window, has been added to validate whether at least a specified minimum number of records arrive within every specified time window, ensuring data freshness. This function is customizable, allowing users to define the time window, minimum records per window, and lookback period. - Improvements have been made to the performance of aggregation check functions, and the check message format has been updated for better readability.
- Created llm util function to get check functions details (#469). A new utility function has been introduced to provide definitions of all check functions, enabling the generation of prompts for Large Language Models (LLMs) to create check functions.
- Added equality safe row and column matching in compare datasets check (#473). The compare datasets check functionality has been enhanced to handle null values during row matching and column value comparisons, improving its robustness and flexibility. Two new optional parameters,
null_safe_row_matchingandnull_safe_column_value_matching, have been introduced to control how null values are handled, both defaulting to True. These parameters allow for null-safe primary key matching and column value matching, ensuring accurate comparison results even when null values are present in the data. The check now excludes specific columns from value comparison using theexclude_columnsparameter while still considering them for row matching. - Fixed datetime rounding logic in profiler (#483). The datetime rounding logic has been improved in profiler to respect the
round=Falseoption, which was previously ignored. The code now handles theOverflowErrorthat occurs when rounding up the maximum datetime value by capping the result and logging a warning. - Added loading and saving checks from file in Unity Catalog Volume (#512). This change introduces support for storing quality checks in a Unity Catalog Volume, in addition to existing storage types such as tables, files, and workspace files. The storage location of quality checks has been unified into a single configuration field called
checks_location, replacing the previouschecks_fileandchecks_tablefields, to simplify the configuration and remove ambiguity by ensuring only one storage location can be defined per run configuration. Thechecks_locationfield can point to a file in the local path, workspace, installation folder, or Unity Catalog Volume, providing users with more flexibility and clarity when managing their quality checks. - Refactored methods for loading and saving checks (#487). The
DQEngineclass has undergone significant changes to improve modularity and maintainability, including the unification of methods for loading and saving checks under theload_checksandsave_checksmethods, which take aconfigparameter to determine the storage type, such asFileChecksStorageConfig,WorkspaceFileChecksStorageConfig,TableChecksStorageConfig, orInstallationChecksStorageConfig. - Storing checks using dqx classes (#474). The data quality engine has been enhanced with methods to convert quality checks between
DQRuleobjects and Python dictionaries, allowing for flexibility in check definition and usage. Theserialize_checksmethod converts a list ofDQRuleinstances into a dictionary representation, while thedeserialize_checksmethod performs the reverse operation, converting a dictionary representation back into a list ofDQRuleinstances. Additionally, theDQRuleclass now includes ato_dictmethod to convert aDQRuleinstance into a structured dictionary, providing a standardized representation of the rule's metadata. These changes enable users to work with checks in both formats, store and retrieve checks easily, and improve the overall management and storage of data quality checks. The conversion process supports local execution and handles non-complex column expressions, although complex PySpark expressions or Python functions may not be fully reconstructable when converting from class to metadata format. - Added llm utility funciton to extract checks examples in yaml from docs (#506). This is achieved through a new Python script that extracts YAML examples from MDX documentation files and creates a combined YAML file with all the extracted examples. The script utilizes regular expressions to extract YAML code blocks from MDX content, validates each YAML block, and combines all valid blocks into a single list. The combined YAML file is then created in the LLM resources directory for use in language model processing.
BREAKING CHANGES!
- The
checks_fileandchecks_tablefields have been removed from the installation run configuration. They are now consolidated into the singlechecks_locationfield. This change simplifies the configuration and clearly defines where checks are stored. - The
load_run_configmethod has been moved toconfig_loader.RunConfigLoader, as it is not intended for direct use and falls outside theDQEnginecore responsibilities.
DEPRECIATION CHANGES!
If you are loading or saving checks from a storage (file, workspace file, table, installation), you are affected. We are deprecating the below methods. We are keeping the methods in the DQEngine but you should update your code as these methods will be removed in future versions.
- Loading checks to storage has been unified under
load_checksmethod. The following methods have been removed from theDQEngine:
load_checks_from_local_file,load_checks_from_workspace_file,load_checks_from_installation,load_checks_from_table. - Saving checks in storage has been unified under
load_checksmethod. The following methods have been removed from theDQEngine:
save_checks_in_local_file,save_checks_in_workspace_file,save_checks_in_installation,save_checks_in_table.
The save_checks and load_checks take config as a parameter, which determines the storage types used. The following storage configs are currently supported:
FileChecksStorageConfig: file in the local filesystem (YAML or JSON)WorkspaceFileChecksStorageConfig: file in the workspace (YAML or JSON)TableChecksStorageConfig: a tableInstallationChecksStorageConfig: storage defined in the installation context, using either thechecks_tableorchecks_filefield from the run configuration.
Contributors: @mwojtyczka, @karthik-ballullaya-db, @bsr-the-mngrm, @ajinkya441, @cornzyblack, @ghanse, @jominjohny, @dinbab1984