Description
Is there an existing issue for this?
- I have searched the existing issues
Problem statement
We frequently encounter data quality errors which signify the subgroup of a given data cannot be relied upon. As an example, certain sensor readings indicate that the test step that contains the erroneous reading shouldn't be trusted.
For these type of use cases, it would be ideal to be able to set a severity level for the check, and specify a list of columns to group the data . If an error is encountered, then all the rows in the given subgroup is marked/quarantined
Proposed Solution
- Addition of a new optional parameter for the DQEngine to specify the list of columns for grouping.
- Addition of a new boolean field to the dq rules that will let users specify if an error is severe
If grouping columns are provided and if the dq rule is set to severe, any error that is detected would result in the whole subgroup to be quarantined or marked.
Additional Context
We've made an in house data quality framework where we provide this funcitonality. If you start accepting external contributors, please notify me and I would be happ to try and contribute the implementation for this to your library.