[BUGFIX] Narrow single-pass column_values.unique on SQLAlchemy (Redshift WLM) by leodrivera · Pull Request #11863 · great-expectations/great_expectations

leodrivera · 2026-05-04T18:55:56Z

[BUGFIX] Narrow single-pass `column_values.unique` on SQLAlchemy (Redshift WLM)

Problem

On column-store backends (observed on Amazon Redshift), expect_column_values_to_be_unique against large wide tables (~120 columns including JSON / SUPER fields) is cancelled by the WLM low_timeout rule:

psycopg2.errors.InternalError_: Query (…) cancelled by WLM abort action of Query Monitoring Rule "low_timeout".
code: 1078

Root cause

The SQLAlchemy implementation of column_values.unique previously expanded into:

SELECT col NOT IN (
  SELECT col FROM source WHERE … GROUP BY col HAVING count(col) > 1
)

Redshift planned the NOT IN semi-join as two full scans of the source table. Wall time on the user's table consistently exceeded the WLM low_timeout threshold.

Attempt history

Attempt 1 — Window-function single-pass (commit `edd207e6`)

Replaced NOT IN (dup_subquery) with col IN (SELECT col FROM (SELECT col, count() OVER (PARTITION BY col) AS rc FROM source) WHERE rc > 1). Redshift could now scan the source exactly once.

Result: Failures dropped sharply but did not go to zero. WLM low_timeout still occasionally tripped on the same query.

Attempt 2 — Single-pass windowed selectable mirroring `compound_columns.unique` (commit `de58111688`)

Refactored to the same shape used by compound_columns.unique: a column_values.count_per_value function metric materializes a windowed subquery (SELECT *table_columns, count() OVER PARTITION BY col AS _num_rows FROM source) and the condition becomes a simple _num_rows < 2 comparison. The MySQL/SingleStore temp-table workaround was removed (no longer needed — source is referenced once).

Result: After a week of production traffic, only a single WLM cancellation occurred, but the failure mode persisted on the very wide table (~120 columns, several of which are SUPER / JSON / long-VARCHAR).

Diagnosis after Attempt 2

The remaining failure mode is not a double-scan; it is a wide-row window problem.

The windowed subquery projected every source column (the compound_columns.unique pattern requires this so that downstream unexpected_rows / unexpected_index_list paths can read original columns from the same selectable). The Redshift window operator therefore had to carry ~120 columns — including JSON SUPER fields — through partition redistribution and sort. On non-DISTKEY columns this translates into shipping thousands of bytes per row across slices, plus a wide sort that can spill to disk and trip low_timeout.

Confirmed by inspecting the actual SQL captured in the user's error trace: the inner subquery's projection lists every column of the source table.

Attempt 3 (this PR) — Narrow windowed selectable + join-back for row-retrieval

Two changes that together address the wide-row window without re-introducing the double-scan:

column_values.count_per_value projects only the target column and _num_rows.
The window operator now sorts narrow rows (1 column + the running count) regardless of source-table width. This is the fast path served to unexpected_count, unexpected_values, and unexpected_value_counts.
unexpected_rows and unexpected_index_list are overridden to use a narrow GROUP BY col HAVING count(*) >= 2 dup-keys subquery joined back to the source table. These paths are only exercised when the caller requests SUMMARY / COMPLETE result_format; the default BASIC path is unaffected. The dup-keys subquery itself reads only the target column, and the join back to source reads only the columns actually returned to the user.

The is_sqlalchemy_metric_selectable registry keeps column_values.unique (the condition metric name) so that the framework's standard auxiliary methods correctly read their FROM clause from the narrow windowed subquery. The previously added column_values.count_per_value entry (Attempt 2) was redundant and is removed.

SQL shape after this PR

`result_format` path	Generated shape	Source scans	Window row width
`unexpected_count` (default `BASIC`)	`SELECT SUM(CASE WHEN _num_rows >= 2 THEN 1 ELSE 0 END) FROM (SELECT col, count() OVER (PARTITION BY col) AS _num_rows FROM source)`	1	1 column
`unexpected_values`, `unexpected_value_counts`	`SELECT col [, COUNT(*)] FROM (… narrow windowed subquery …) WHERE _num_rows >= 2 [GROUP BY col]`	1	1 column
`unexpected_rows` (`SUMMARY` / `COMPLETE`)	`SELECT t.* FROM source t JOIN (SELECT col FROM source WHERE col IS NOT NULL GROUP BY col HAVING count(*) >= 2) d ON t.col = d.col`	2 (narrow agg + join)	n/a
`unexpected_index_list`	`SELECT idx_cols, t.col FROM source t JOIN (… narrow dup-keys agg …) d ON t.col = d.col`	2	n/a

Before this PR, every path paid the wide-row window cost. After this PR, the window is always narrow, and the only multi-scan path is row-retrieval which is opt-in (SUMMARY / COMPLETE) and reads narrowly.

Files changed

great_expectations/expectations/metrics/column_map_metrics/column_values_unique.py — narrow windowed selectable; subclass override of _register_metric_functions to replace the SQLAlchemy unexpected_rows and unexpected_index_list aux methods.
great_expectations/expectations/metrics/map_metric_provider/is_sqlalchemy_metric_selectable.py — keep column_values.unique registered (Attempt 2 also added column_values.count_per_value; redundant and removed).
tests/expectations/metrics/test_core.py — replaces the PARTITION BY-only regression assertion with two regressions:
1. unexpected_count SQL must contain PARTITION BY, must not contain NOT IN, and must not project any non-target source column through the windowed subquery (regression guard against the wide-row failure mode from Attempt 2).
2. unexpected_rows SQL must use a GROUP BY / HAVING dup-keys subquery joined back to the source.

Compatibility / non-goals

Pandas and Spark implementations are unchanged.
MySQL / SingleStore temp-table workaround remains unnecessary (the source table is still referenced exactly once on the dominant unexpected_count path).
No Expectation API surface changes; only the generated SQL changes.
DISTKEY on the target column further reduces shuffle but is no longer required to stay inside low_timeout on the observed table.

Operational note for users hitting this on Redshift

If you are on a wide table and expect_column_values_to_be_unique is still slow after this fix, verify that the target column is the table's DISTKEY (or at least not random-distributed). The narrow window shipped here makes shuffle volume scale with row count rather than row width, but DISTKEY on the partition column avoids the shuffle entirely.

netlify · 2026-05-04T18:56:01Z

👷 Deploy request for niobium-lead-7998 pending review.

Visit the deploys page to approve it

Name	Link
🔨 Latest commit	`23a7a69`

Project only the target column through the windowed subquery instead of every source column. The previous shape carried all source columns (including JSON/SUPER fields on Redshift) through the window operator, which intermittently tripped WLM "low_timeout" on wide tables. "unexpected_rows" and "unexpected_index_list" are overridden to use a narrow GROUP BY/HAVING dup-keys subquery joined back to source, so they no longer require the windowed selectable to carry every source column.

for more information, see https://pre-commit.ci

…nique-window-function-redshift-wlm

leodrivera marked this pull request as draft May 5, 2026 16:38

leodrivera changed the title ~~[BUGFIX] Use window function in column_values.unique to prevent Redshift WLM timeouts~~ [BUGFIX] Single-pass windowed selectable for column_values.unique to prevent Redshift WLM timeouts May 5, 2026

leodrivera force-pushed the bugfix/column-values-unique-window-function-redshift-wlm branch from 218064f to 112bc85 Compare May 12, 2026 15:49

[pre-commit.ci] auto fixes from pre-commit.com hooks

73aaac4

for more information, see https://pre-commit.ci

leodrivera changed the title ~~[BUGFIX] Single-pass windowed selectable for column_values.unique to prevent Redshift WLM timeouts~~ [BUGFIX] Narrow single-pass column_values.unique on SQLAlchemy (Redshift WLM) May 12, 2026

Merge branch 'great-expectations:develop' into bugfix/column-values-u…

23a7a69

…nique-window-function-redshift-wlm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUGFIX] Narrow single-pass column_values.unique on SQLAlchemy (Redshift WLM)#11863

[BUGFIX] Narrow single-pass column_values.unique on SQLAlchemy (Redshift WLM)#11863
leodrivera wants to merge 3 commits into
great-expectations:developfrom
leodrivera:bugfix/column-values-unique-window-function-redshift-wlm

leodrivera commented May 4, 2026 •

edited

Loading

Uh oh!

netlify Bot commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

leodrivera commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[BUGFIX] Narrow single-pass column_values.unique on SQLAlchemy (Redshift WLM)

Problem

Root cause

Attempt history

Attempt 1 — Window-function single-pass (commit edd207e6)

Attempt 2 — Single-pass windowed selectable mirroring compound_columns.unique (commit de58111688)

Diagnosis after Attempt 2

Attempt 3 (this PR) — Narrow windowed selectable + join-back for row-retrieval

SQL shape after this PR

Files changed

Compatibility / non-goals

Operational note for users hitting this on Redshift

Uh oh!

netlify Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

👷 Deploy request for niobium-lead-7998 pending review.

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

leodrivera commented May 4, 2026 •

edited

Loading

[BUGFIX] Narrow single-pass `column_values.unique` on SQLAlchemy (Redshift WLM)

Attempt 1 — Window-function single-pass (commit `edd207e6`)

Attempt 2 — Single-pass windowed selectable mirroring `compound_columns.unique` (commit `de58111688`)

netlify Bot commented May 4, 2026 •

edited

Loading