[BUGFIX] ExpectColumnDistinctValuesToEqualSet with database-pushed comparison#11616
Open
NathanFarmer wants to merge 76 commits intodevelopfrom
Open
[BUGFIX] ExpectColumnDistinctValuesToEqualSet with database-pushed comparison#11616NathanFarmer wants to merge 76 commits intodevelopfrom
ExpectColumnDistinctValuesToEqualSet with database-pushed comparison#11616NathanFarmer wants to merge 76 commits intodevelopfrom
Conversation
…hed comparison - Add ColumnDistinctValuesNotEqualSet metric to find bidirectional differences in DB - Refactor expectation to use new metric instead of fetching all distinct values - Add Metrics API wrapper class for new metric - Respects result_format and partial_unexpected_count settings
✅ Deploy Preview for niobium-lead-7998 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #11616 +/- ##
===========================================
+ Coverage 84.19% 84.22% +0.03%
===========================================
Files 471 471
Lines 39591 39606 +15
===========================================
+ Hits 33332 33357 +25
+ Misses 6259 6249 -10
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
ExpectColumnDistinctValuesToEqualSet with database-pushed comparison
- Add ValidationDependencies to TYPE_CHECKING for type annotations - Import inside method for runtime use - Avoids circular import while keeping unquoted type annotation
…rcion - Always fetch column.distinct_values for type coercion and observed_value - Coerce value_set to match column value types before comparison - Return all distinct values in observed_value when successful (backward compatible) - Return value_counts in details when result_format is COMPLETE - Handle type coercion for date/string mismatches - Fix TypeError when sorting mixed types by catching exception
…unt on failure, filter nulls - Cast partial_unexpected_count to int() to fix type checker error - Only include unexpected_count in result when there are violations - Filter out null/NaN values from observed_value_set before comparison - Add PLR0915 noqa comment for complexity - Fixes test failures for result format, type checking, and null handling
- Check type before converting to int to satisfy type checker - Handle case where partial_unexpected_count might not be int or str
- Test ColumnDistinctValuesNotEqualSet for all data sources - Tests cover: sets equal, column has extra, set has extra, both have extra, completely different - Tests verify limit parameter works correctly
…tric - Add test_dates_equal for date objects in value_set - Add test_dates_with_str_value_set for string dates (tests BigQuery DATE coercion) - Uses DATA_SOURCES_THAT_SUPPORT_DATE_COMPARISONS which includes BigQuery
Metrics don't do type coercion - that's an expectation-level feature. The test_dates_equal test with proper date objects is sufficient for metrics.
for more information, see https://pre-commit.ci
observed_value now contains only differences (values in column but not in set, or values in set but not in column), not all distinct values. This is a breaking change that avoids fetching all distinct values into memory for high-cardinality columns. Also includes details.in_column_not_in_set and details.in_set_not_in_column when the expectation fails.
…_equal_set - observed_value now contains only differences, not all distinct values - details now contains in_column_not_in_set and in_set_not_in_column - No more details.value_counts in results - String values are no longer auto-coerced to match date columns
…tial_unexpected_list to top-level result for equal_set
…ove redundant None handling
…or distinct values Expectations
ExpectColumnDistinctValuesToEqualSet with database-pushed comparisonExpectColumnDistinctValuesToEqualSet with database-pushed comparison
wookasz
reviewed
Feb 4, 2026
wookasz
approved these changes
Feb 4, 2026
wookasz
requested changes
Feb 4, 2026
Contributor
wookasz
left a comment
There was a problem hiding this comment.
Blocking as a breaking change
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refactor
ExpectColumnDistinctValuesToEqualSetto use database-pushdown metricsThis PR optimizes
ExpectColumnDistinctValuesToEqualSetfor high-cardinality columns by pushing set comparison logic to the database instead of fetching all distinct values into Python memory.We no longer return an observed value of all distinct values, because this was bad practice and is not particularly actionable.
Changes:
_validateto usecolumn.distinct_values.not_in_set.count,column.distinct_values.not_in_set,column.distinct_values.missing_from_column.count, andcolumn.distinct_values.missing_from_columnmetricsobserved_value: Nonewithunexpected_count,partial_unexpected_list,missing_count, andpartial_missing_listat the top level_atomic_diagnostic_observed_valuerenderer to display unexpected values withUNEXPECTEDstate and missing values withMISSINGstateMAX_DISTINCT_VALUES(20) to limitpartial_unexpected_listandpartial_missing_listsize, regardless of the globalpartial_unexpected_countdefaultformat_results.mdto addpartial_missing_listfieldgreat_expectations/metrics/__init__.pypublic API exportsManual Test
Expectation
Test Script Output