[BUGFIX] ExpectColumnDistinctValuesToContainSet with database-pushed comparison#11615
Open
NathanFarmer wants to merge 75 commits intodevelopfrom
Open
[BUGFIX] ExpectColumnDistinctValuesToContainSet with database-pushed comparison#11615NathanFarmer wants to merge 75 commits intodevelopfrom
ExpectColumnDistinctValuesToContainSet with database-pushed comparison#11615NathanFarmer wants to merge 75 commits intodevelopfrom
Conversation
…ushed comparison - Add ColumnDistinctValuesMissingFromSet metric to find missing values in DB - Refactor expectation to use new metric instead of fetching all distinct values - Add Metrics API wrapper class for new metric - Respects result_format and partial_unexpected_count settings
✅ Deploy Preview for niobium-lead-7998 ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
ExpectColumnDistinctValuesToContainSet with database-pushed comparison
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #11615 +/- ##
===========================================
+ Coverage 84.19% 84.21% +0.02%
===========================================
Files 471 471
Lines 39591 39591
===========================================
+ Hits 33332 33343 +11
+ Misses 6259 6248 -11
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
- Add ValidationDependencies to TYPE_CHECKING for type annotations - Import inside method for runtime use - Avoids circular import while keeping unquoted type annotation
…rcion - Always fetch column.distinct_values for type coercion and observed_value - Coerce value_set to match column value types before comparison - Return all distinct values in observed_value when successful (backward compatible) - Return value_counts in details when result_format is COMPLETE - Handle type coercion for date/string mismatches
…_count on failure - Cast partial_unexpected_count to int to fix type checker error - Only include unexpected_count in result when there are violations - Filter out nulls from observed values before comparison - Fixes test failures for result format and type checking
- Filter out null/NaN values from observed_value_set before comparison - Ensures nulls are ignored as expected by the expectation - Add PLR0912 noqa comment for complexity
- Check type before converting to int to satisfy type checker - Handle case where partial_unexpected_count might not be int or str
- Test ColumnDistinctValuesMissingFromSet for all data sources - Tests cover: all set values present, some missing, all missing, subset - Tests verify limit parameter works correctly
… metric - Add test_dates_all_present for date objects in value_set - Add test_dates_with_str_value_set for string dates (tests BigQuery DATE coercion) - Uses DATA_SOURCES_THAT_SUPPORT_DATE_COMPARISONS which includes BigQuery
Metrics don't do type coercion - that's an expectation-level feature. The test_dates_all_present test with proper date objects is sufficient for metrics.
for more information, see https://pre-commit.ci
observed_value now contains only missing values (values from value_set not found in column), not all distinct values. This is a breaking change that avoids fetching all distinct values into memory for high-cardinality columns.
…_contain_set - observed_value now contains only missing values, not all distinct values - No more details.value_counts in results - String values are no longer auto-coerced to match date columns
…pected_list - observed_value is now None (semantically correct - it's not observed values) - Missing values go in partial_unexpected_list (limited by partial_unexpected_count) - unexpected_count always included when not BOOLEAN_ONLY - Renderer returns '--' for observed value since it's None
…wargs flow automatically
…ove redundant None handling
…or distinct values Expectations
ExpectColumnDistinctValuesToContainSet with database-pushed comparisonExpectColumnDistinctValuesToContainSet with database-pushed comparison
wookasz
approved these changes
Feb 4, 2026
wookasz
requested changes
Feb 4, 2026
Contributor
wookasz
left a comment
There was a problem hiding this comment.
Blocking as a breaking change
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refactor
ExpectColumnDistinctValuesToContainSetto use database-pushdown metricsThis PR optimizes
ExpectColumnDistinctValuesToContainSetfor high-cardinality columns by pushing set comparison logic to the database instead of fetching all distinct values into Python memory.We no longer return an observed value of all distinct values, because this was bad practice and is not particularly actionable.
Changes:
_validateto usecolumn.distinct_values.missing_from_column.countandcolumn.distinct_values.missing_from_columnmetricsobserved_value: Nonewithmissing_countandpartial_missing_listat the top level_atomic_diagnostic_observed_valuerenderer to display missing values withMISSINGstateMAX_DISTINCT_VALUES(20) to limitpartial_missing_listsize, regardless of the globalpartial_unexpected_countdefaultformat_results.mdto addpartial_missing_listfieldgreat_expectations/metrics/__init__.pypublic API exportsManual Test
Expectation
Test Script Output