v0.6.0
Release v0.6.0 (#395)
- Added Dataset-level checks, Foreign Key and SQL Script checks (#375). The data quality library has been enhanced with the introduction of dataset-level checks, which allow users to apply quality checks at the dataset level, in addition to existing row-level checks. Similar to row-level checks, the results of the dataset-level quality checks are reported for each individual row in the result columns. A new
DQDatasetRuleclass has been added to define dataset-level checks, and several new check functions have been added including theforeign_keyandsql_querydataset-level checks. The library now also supports custom dataset-level checks using arbitrary SQL queries and provides the ability to define checks on multiple DataFrames or Tables. TheDQEngineclass has been modified to optionally accept Spark session as a parameter in its constructor, allowing users to pass their own Spark session. Major internal refactorization has been carried out to improve code maintenance and structure. - Added Demo from Data and AI Summit 2025 - DQX Demo for Manufacturing Industry (#391).
- Added Github Action to check if all commits are signed (#392). The library now includes a GitHub Action that automates the verification of signed commits in pull requests, enhancing the security and integrity of the codebase. This action checks each commit in a pull request to ensure it is signed using the
git commit -Scommand, and if any unsigned commits are found, it posts a comment with instructions on how to properly sign
commits. - Added methods to profile multiple tables (#374). The data profiling feature has been significantly enhanced with the introduction of two new methods,
profile_tableandprofile_tables, which enable direct profiling of Delta tables, allowing users to generate summary statistics and candidate data quality rules. These methods provide a convenient way to profile data stored in Delta tables, withprofile_tablegenerating a profile from a single Delta table andprofile_tablesgenerating profiles from multiple Delta tables using explicit table lists or regex patterns for inclusion and exclusion. The profiling process is highly customizable, supporting extensive configuration options such as sampling, outlier detection, null value handling, and string handling. The generated profiles can be used to create Delta Live Tables expectations for enforcing data quality rules, and the profiling results can be stored in a table or file as YAML or JSON for easy management and reuse. - Created a quick start demo (#367). A new demo notebook has been introduced to provide a quickstart guide for utilizing the library, enabling users to easily test features using the Databricks Power Tools and Databricks Extension in VS Code. This demo notebook showcases both configuration styles side by side, applying the same rules for direct comparison, and includes a small, hardcoded sample dataset for quick experimentation, designed to be executed cell-by-cell in VS Code. The addition of this demo aims to help new users understand and compare both configuration approaches in a practical context, facilitating a smoother onboarding experience.
- Pin GitHub URLs in docs to the latest released version (#390). The formatting process now includes an additional step to update GitHub URLs, ensuring they point to the latest released version instead of the main branch, which helps prevent access to unreleased changes. This update is automated during the release process and allows users to review changes before committing.
BREAKING CHANGES!
- Moved existing
is_unique,is_aggr_not_greater_thanandis_aggr_not_less_thanchecks under dataset-level checks umbrella. These checks must be defined usingDQDatasetRuleclass and notDQRowRuleanymore. Input parameters remain the same as before. This is a breaking change for checks defined using DQX classes. Yaml/Json definitions are not affected. DQRowRuleForEachColhas been renamed toDQForEachColRuleto make it generic and handle both row and dataset level rules.- Renamed
column_namestoresult_column_namesin theExtraParamsfor clarity as they may be confused with column(s) specified for the rules itself. This is a breaking change!
Contributors: @mwojtyczka @alexott @nehamilak-db @ghanse @Escoto