Skip to content

Feature to return the processed keys during a data quality validation #11010

@jmcorreia

Description

@jmcorreia

We have a use case, for which we want to be able to store the status of all the open and closed Data Quality problems.
For doing so, we rely on GX unexpected_index_list to have the PKs that got DQ problems and we have some custom logic on our framework to get all the unique keys on the dataset being validated - we call it processed_keys.
Afterwards we match these with the unexpected_index_list considering all history of runs, and if the previously identified bad PKs appear as good records in the following runs, we close the Data Quality problem.

This has been working until now in GX 0.18.8, even though the processing is challenging.
However, when people use row_conditions, this impose challenges as we cannot really rely on the unique keys of the Dataset, because the expectations with row_conditions are not considering the entire dataset, but a filtered version.

Of course we could try to change our logic to build the "processed_keys" for each row_condition, but this will mean additional processing and also additional data being stored. E.g. as of now we store a single processed_keys dict for the entire dataset being validated, having it 1 or 100 expectations. If we would accommodate this logic, we would have to build as much dicts as expectations with row_conditions.

IDEA PROPOSED

  • as you already build the list of keys for failures, would it be possible for you to also retrieve the keys for good records or for processed records?
  • or, would you be able to retrieve the keys that are ignored by applying the row_condition?
  • any other idea?

Additional context

  • Please let me know what you think of the idea, if you would have any other idea and if you would need more details

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    To Do

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions