Skip to content

v0.6.0

Choose a tag to compare

@github-actions github-actions released this 26 Jun 12:13
· 64 commits to main since this release
54ec2b3

Release v0.6.0 (#395)

  • Added Dataset-level checks, Foreign Key and SQL Script checks (#375). The data quality library has been enhanced with the introduction of dataset-level checks, which allow users to apply quality checks at the dataset level, in addition to existing row-level checks. Similar to row-level checks, the results of the dataset-level quality checks are reported for each individual row in the result columns. A new DQDatasetRule class has been added to define dataset-level checks, and several new check functions have been added including the foreign_key and sql_query dataset-level checks. The library now also supports custom dataset-level checks using arbitrary SQL queries and provides the ability to define checks on multiple DataFrames or Tables. The DQEngine class has been modified to optionally accept Spark session as a parameter in its constructor, allowing users to pass their own Spark session. Major internal refactorization has been carried out to improve code maintenance and structure.
  • Added Demo from Data and AI Summit 2025 - DQX Demo for Manufacturing Industry (#391).
  • Added Github Action to check if all commits are signed (#392). The library now includes a GitHub Action that automates the verification of signed commits in pull requests, enhancing the security and integrity of the codebase. This action checks each commit in a pull request to ensure it is signed using the git commit -S command, and if any unsigned commits are found, it posts a comment with instructions on how to properly sign
    commits.
  • Added methods to profile multiple tables (#374). The data profiling feature has been significantly enhanced with the introduction of two new methods, profile_table and profile_tables, which enable direct profiling of Delta tables, allowing users to generate summary statistics and candidate data quality rules. These methods provide a convenient way to profile data stored in Delta tables, with profile_table generating a profile from a single Delta table and profile_tables generating profiles from multiple Delta tables using explicit table lists or regex patterns for inclusion and exclusion. The profiling process is highly customizable, supporting extensive configuration options such as sampling, outlier detection, null value handling, and string handling. The generated profiles can be used to create Delta Live Tables expectations for enforcing data quality rules, and the profiling results can be stored in a table or file as YAML or JSON for easy management and reuse.
  • Created a quick start demo (#367). A new demo notebook has been introduced to provide a quickstart guide for utilizing the library, enabling users to easily test features using the Databricks Power Tools and Databricks Extension in VS Code. This demo notebook showcases both configuration styles side by side, applying the same rules for direct comparison, and includes a small, hardcoded sample dataset for quick experimentation, designed to be executed cell-by-cell in VS Code. The addition of this demo aims to help new users understand and compare both configuration approaches in a practical context, facilitating a smoother onboarding experience.
  • Pin GitHub URLs in docs to the latest released version (#390). The formatting process now includes an additional step to update GitHub URLs, ensuring they point to the latest released version instead of the main branch, which helps prevent access to unreleased changes. This update is automated during the release process and allows users to review changes before committing.

BREAKING CHANGES!

  • Moved existing is_unique, is_aggr_not_greater_than and is_aggr_not_less_than checks under dataset-level checks umbrella. These checks must be defined using DQDatasetRule class and not DQRowRule anymore. Input parameters remain the same as before. This is a breaking change for checks defined using DQX classes. Yaml/Json definitions are not affected.
  • DQRowRuleForEachCol has been renamed to DQForEachColRule to make it generic and handle both row and dataset level rules.
  • Renamed column_names to result_column_names in the ExtraParams for clarity as they may be confused with column(s) specified for the rules itself. This is a breaking change!

Contributors: @mwojtyczka @alexott @nehamilak-db @ghanse @Escoto