fix(redshift): use boundary-aware segment stitching for query reconstruction by kyungsoo-datahub · Pull Request #16253 · datahub-project/datahub

kyungsoo-datahub · 2026-02-18T00:26:04Z

Redshift stores queries in fixed-width segments (200 chars for provisioned, 4000 for serverless). When using LISTAGG with per-segment RTRIM, tokens at boundaries merge without spaces (GROUP BY → GROUPBY).

Fix: Insert a space when trimmed segment length is less than segment size (indicates padding). Applied to all 6 LISTAGG expressions in both modes.

Also: Replace stl_query.querytxt (truncated to 4000 chars) with STL_QUERYTEXT CTE in provisioned scan-based lineage.

Added tests for boundary detection and segment size correctness.

…ruction Redshift stores queries in fixed-width character(200) or character(4000) segments. RTRIM per-segment and RTRIM(LISTAGG(text)) both fail because character(n) padding is stripped before LISTAGG receives values, merging keywords at boundaries (e.g. GROUP BY -> GROUPBY). Fix: add a space back when trimmed segment length < segment size. Applied to all 5 LISTAGG locations. Also replaced stl_query.querytxt (truncated to 4000 chars) with a CTE from STL_QUERYTEXT in provisioned scan lineage.

github-actions · 2026-02-18T00:26:15Z

Linear: ING-1654

codecov · 2026-02-18T00:28:35Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

github-actions · 2026-03-02T23:08:33Z

Linear: ING-1802

rajatoss · 2026-03-02T23:58:37Z

Connector Tests Results

Connector tests failed for commit e27b075

View full test logs →

Autogenerated by the connector-tests CI pipeline.

sgomezvillamor · 2026-03-03T13:56:33Z

My overall concern is that we’re adding quite a lot of complexity (and therefore risk of errors) to address what seems to be an anecdotal issue. How recurrent is this issue?

Beyond complexity, proposed solution is not perfect.

The keyword-adjacency heuristic is fragile. We’d need a comprehensive reserved word list, and edge cases will grow quickly. For example:

-- GROUP as a keyword — needs space
SELECT a, COUNT(*) FROM t GROUP<CHR1>BY a

-- GROUP as an identifier — space is harmless but detection logic
-- must correctly classify it, which requires parsing context
SELECT group<CHR1>_id FROM t   -- identifier, no space needed

This would require bidirectional lookahead around every marker, which increases complexity even more.

The reserved word list itself is also a maintenance burden. Redshift’s reserved words may change across versions, and include non-obvious entries. Keeping that list correct and up to date is operational overhead.

Additionally, there’s the risk of false positives from string literals. CHR(1) is the SOH (Start of Heading) ASCII control character. It’s exotic, but Redshift allows it inside string literals. Very unlikely, but still technically possible.

We currently read from SQL_QUERYTEXT (or SQL_QUERY_TEXT in serverless).

There is also SVL_STATEMENTTEXT: https://docs.aws.amazon.com/redshift/latest/dg/r_SVL_STATEMENTTEXT.html

That view is built on top of SQL_QUERYTEXT. I’m not sure if it’s available in serverless. The docs describe how to reconstruct statements there.

Does SVL_STATEMENTTEXT suffer from the same reconstruction issues as SQL_QUERYTEXT? If not, maybe we could switch to it.

If it’s not supported in serverless (given the 4K char limit there), is this issue still frequent enough that we must address it?

Another alternative would be audit logging to S3: https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing.html

From what I understand, the query statement is not split there. But this would be a completely different ingestion approach.

I agree this may not feasible in the short term.

My current view:

The current proposal adds significant complexity and is still not a perfect solution.
Can we consider a simpler, more naive concat? Not perfect either, but much less complex.
And should we evaluate SVL_STATEMENTTEXT more seriously first?

IMO final decision depends on how recurrent is the issue.

… tests

alwaysmeticulous · 2026-03-03T21:46:51Z

🔴 Meticulous spotted visual differences in 1 of 1373 screens tested: view and approve differences detected.

Meticulous evaluated ~8 hours of user flows against your PR.

_{Last updated for commit d156355. This comment will update as new commits are pushed.}

Extract _PROVISIONED_SEGMENT_SIZE (200) and _SERVERLESS_SEGMENT_SIZE (4000) constants, replace hardcoded values in LISTAGG SQL with placeholders.

sgomezvillamor · 2026-03-04T08:20:12Z

+                        SELECT
+                            query,
+                            userid,
+                            RTRIM(LISTAGG(RTRIM(text) || CASE WHEN LEN(RTRIM(text)) < {_PROVISIONED_SEGMENT_SIZE} THEN ' ' ELSE '' END, '')


LGTM
Just wondering how is the implementation in PR is slightly different to the one suggested here

https://docs.aws.amazon.com/redshift/latest/dg/r_SVL_STATEMENTTEXT.html

select LISTAGG(CASE WHEN LEN(RTRIM(text)) = 0 THEN text ELSE RTRIM(text) END, '') within group (order by sequence) AS query_statement from SVL_STATEMENTTEXT where pid=pg_backend_pid();

AWS's approach doesn't fix the keyword merge problem. Their condition LEN(RTRIM(text)) = 0 only preserves completely empty segments. A segment like "GROUP " becomes "GROUP" (length 5, not 0), which merges with the next "BY" segment -> "GROUPBY"

Our approach uses segment size as a boundary detector: if LEN(RTRIM(text)) < segment_size, we add a space back because the segment likely reached a natural boundary (not just had trailing spaces stripped).

sgomezvillamor · 2026-03-04T08:34:30Z

Basically, we are just doing string contains validations

So I would remove the focus on "stitching" and make this just a generic test for redshift queries

test_redshift_queries.py

class TestProvisionedQueries

class TestServerlessQueries

...

So we can add more validations in the future

sgomezvillamor · 2026-03-04T08:36:33Z

+            assert "LEN(RTRIM(querytxt)) = 0" not in sql
+
+
+class TestBoundaryAwareStitchingLogic:


I don't see much value on these tests using _simulate_boundary_aware_stitching. We could just remove them.

If we really want to test this in an integration test, we could add a eg long enough select in https://github.com/acryldata/connector-tests/blob/main/smoke-test/integration/create_data/redshift_data.py

sgomezvillamor

LGTM

- fix(ingest/teradata): set DATABASE context for view HELP commands (#16208) - fix(redshift): use boundary-aware segment stitching for query reconstruction (#16253) - fix(ingestion): update save button style (#16427) - improvement(ui): design review changes for dataset summary and ingestion page (#16429) - fix(ingestion/oracle): fix profiling crashes and silent table exclusions (#16396) - docs(release): v0.3.16.5-acryl (#16428) - fix(ui): Remove filter we don't support from run results tab (#16433) - feat(agent-context): support sql search filters in mcp tools (#16403)

github-actions Bot added the ingestion PR or Issue related to the ingestion of metadata label Feb 18, 2026

github-actions Bot deployed to datahub-wheels (Preview) February 18, 2026 00:27 View deployment

datahub-cyborg Bot added the needs-review Label for PRs that need review from a maintainer. label Feb 18, 2026

vercel Bot deployed to Preview February 18, 2026 00:40 View deployment

kyungsoo-datahub marked this pull request as draft February 18, 2026 23:25

github-actions Bot deployed to datahub-wheels (Preview) February 19, 2026 00:53 View deployment

vercel Bot deployed to Preview February 19, 2026 01:06 View deployment

github-actions Bot deployed to datahub-wheels (Preview) February 19, 2026 16:07 View deployment

vercel Bot deployed to Preview February 19, 2026 16:20 View deployment

github-actions Bot deployed to datahub-wheels (Preview) February 20, 2026 21:47 View deployment

vercel Bot deployed to Preview February 20, 2026 22:01 View deployment

github-actions Bot deployed to datahub-wheels (Preview) February 20, 2026 23:17 View deployment

github-actions Bot deployed to datahub-wheels (Preview) February 20, 2026 23:29 View deployment

vercel Bot deployed to Preview February 20, 2026 23:42 View deployment

kyungsoo-datahub marked this pull request as ready for review March 2, 2026 23:08

github-actions Bot deployed to datahub-wheels (Preview) March 2, 2026 23:28 View deployment

vercel Bot deployed to Preview March 2, 2026 23:42 View deployment

maggiehays added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 3, 2026

github-actions Bot deployed to datahub-wheels (Preview) March 3, 2026 19:21 View deployment

vercel Bot deployed to Preview March 3, 2026 19:35 View deployment

maggiehays added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 3, 2026

kyungsoo-datahub force-pushed the fix/redshift-query-stitching branch from fb62897 to 347a23b Compare March 3, 2026 21:26

github-actions Bot deployed to datahub-wheels (Preview) March 3, 2026 21:28 View deployment

github-actions Bot deployed to datahub-project-web-react (Preview) March 3, 2026 21:30 View deployment

fix(redshift): add TODO for segment boundary limitation and edge case…

c1dcc44

… tests

github-actions Bot deployed to datahub-wheels (Preview) March 3, 2026 21:39 View deployment

github-actions Bot deployed to datahub-project-web-react (Preview) March 3, 2026 21:42 View deployment

refactor(redshift): use segment size constants in SQL

d156355

Extract _PROVISIONED_SEGMENT_SIZE (200) and _SERVERLESS_SEGMENT_SIZE (4000) constants, replace hardcoded values in LISTAGG SQL with placeholders.

github-actions Bot deployed to datahub-wheels (Preview) March 3, 2026 21:49 View deployment

github-actions Bot deployed to datahub-project-web-react (Preview) March 3, 2026 21:52 View deployment

Merge branch 'master' into fix/redshift-query-stitching

1a1d35e

github-actions Bot deployed to datahub-wheels (Preview) March 3, 2026 21:59 View deployment

vercel Bot deployed to Preview March 3, 2026 22:14 View deployment

sgomezvillamor reviewed Mar 4, 2026

View reviewed changes

sgomezvillamor approved these changes Mar 4, 2026

View reviewed changes

maggiehays added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Mar 4, 2026

kyungsoo-datahub added 3 commits March 4, 2026 06:39

refactor(redshift): rename test file and classes to be generic

1b5c3b0

test(redshift): remove simulation tests, keep pattern validation tests

1ffff9a

Merge branch 'master' into fix/redshift-query-stitching

e27b075

github-actions Bot deployed to datahub-wheels (Preview) March 4, 2026 14:42 View deployment

vercel Bot deployed to Preview March 4, 2026 14:56 View deployment

kyungsoo-datahub merged commit ab93f48 into master Mar 4, 2026
60 of 61 checks passed

kyungsoo-datahub deleted the fix/redshift-query-stitching branch March 4, 2026 15:53

		assert "LEN(RTRIM(querytxt)) = 0" not in sql


		class TestBoundaryAwareStitchingLogic:

Conversation

kyungsoo-datahub commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Feb 18, 2026

Uh oh!

codecov Bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

github-actions Bot commented Mar 2, 2026

Uh oh!

rajatoss commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Connector Tests Results

Uh oh!

sgomezvillamor commented Mar 3, 2026

Uh oh!

alwaysmeticulous Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sgomezvillamor Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

kyungsoo-datahub Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

sgomezvillamor Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

kyungsoo-datahub Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

sgomezvillamor Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

kyungsoo-datahub Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

sgomezvillamor left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kyungsoo-datahub commented Feb 18, 2026 •

edited

Loading

codecov Bot commented Feb 18, 2026 •

edited

Loading

rajatoss commented Mar 2, 2026 •

edited

Loading

alwaysmeticulous Bot commented Mar 3, 2026 •

edited

Loading