Skip to content

fix(redshift): use boundary-aware segment stitching for query reconstruction#16253

Merged
kyungsoo-datahub merged 7 commits into
masterfrom
fix/redshift-query-stitching
Mar 4, 2026
Merged

fix(redshift): use boundary-aware segment stitching for query reconstruction#16253
kyungsoo-datahub merged 7 commits into
masterfrom
fix/redshift-query-stitching

Conversation

@kyungsoo-datahub

@kyungsoo-datahub kyungsoo-datahub commented Feb 18, 2026

Copy link
Copy Markdown
Contributor

Redshift stores queries in fixed-width segments (200 chars for provisioned, 4000 for serverless). When using LISTAGG with per-segment RTRIM, tokens at boundaries merge without spaces (GROUP BY → GROUPBY).

Fix: Insert a space when trimmed segment length is less than segment size (indicates padding). Applied to all 6 LISTAGG expressions in both modes.

Also: Replace stl_query.querytxt (truncated to 4000 chars) with STL_QUERYTEXT CTE in provisioned scan-based lineage.

Added tests for boundary detection and segment size correctness.

…ruction

Redshift stores queries in fixed-width character(200) or character(4000)
segments. RTRIM per-segment and RTRIM(LISTAGG(text)) both fail because
character(n) padding is stripped before LISTAGG receives values, merging
keywords at boundaries (e.g. GROUP BY -> GROUPBY).

Fix: add a space back when trimmed segment length < segment size. Applied
to all 5 LISTAGG locations. Also replaced stl_query.querytxt (truncated
to 4000 chars) with a CTE from STL_QUERYTEXT in provisioned scan lineage.
@github-actions github-actions Bot added the ingestion PR or Issue related to the ingestion of metadata label Feb 18, 2026
@github-actions

Copy link
Copy Markdown
Contributor

Linear: ING-1654

@codecov

codecov Bot commented Feb 18, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@github-actions

github-actions Bot commented Mar 2, 2026

Copy link
Copy Markdown
Contributor

Linear: ING-1802

@rajatoss

rajatoss commented Mar 2, 2026

Copy link
Copy Markdown
Member

Connector Tests Results

Connector tests failed for commit e27b075

View full test logs →

Autogenerated by the connector-tests CI pipeline.

@maggiehays maggiehays added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Mar 3, 2026
@sgomezvillamor

Copy link
Copy Markdown
Contributor

My overall concern is that we’re adding quite a lot of complexity (and therefore risk of errors) to address what seems to be an anecdotal issue. How recurrent is this issue?

Beyond complexity, proposed solution is not perfect.

The keyword-adjacency heuristic is fragile. We’d need a comprehensive reserved word list, and edge cases will grow quickly. For example:

-- GROUP as a keyword — needs space
SELECT a, COUNT(*) FROM t GROUP<CHR1>BY a

-- GROUP as an identifier — space is harmless but detection logic
-- must correctly classify it, which requires parsing context
SELECT group<CHR1>_id FROM t   -- identifier, no space needed

This would require bidirectional lookahead around every marker, which increases complexity even more.

The reserved word list itself is also a maintenance burden. Redshift’s reserved words may change across versions, and include non-obvious entries. Keeping that list correct and up to date is operational overhead.

Additionally, there’s the risk of false positives from string literals. CHR(1) is the SOH (Start of Heading) ASCII control character. It’s exotic, but Redshift allows it inside string literals. Very unlikely, but still technically possible.


We currently read from SQL_QUERYTEXT (or SQL_QUERY_TEXT in serverless).

There is also SVL_STATEMENTTEXT: https://docs.aws.amazon.com/redshift/latest/dg/r_SVL_STATEMENTTEXT.html

That view is built on top of SQL_QUERYTEXT. I’m not sure if it’s available in serverless. The docs describe how to reconstruct statements there.

Does SVL_STATEMENTTEXT suffer from the same reconstruction issues as SQL_QUERYTEXT? If not, maybe we could switch to it.

If it’s not supported in serverless (given the 4K char limit there), is this issue still frequent enough that we must address it?


Another alternative would be audit logging to S3: https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing.html

From what I understand, the query statement is not split there. But this would be a completely different ingestion approach.

I agree this may not feasible in the short term.


My current view:

  • The current proposal adds significant complexity and is still not a perfect solution.
  • Can we consider a simpler, more naive concat? Not perfect either, but much less complex.
  • And should we evaluate SVL_STATEMENTTEXT more seriously first?

IMO final decision depends on how recurrent is the issue.

@maggiehays maggiehays added needs-review Label for PRs that need review from a maintainer. and removed pending-submitter-response Issue/request has been reviewed but requires a response from the submitter labels Mar 3, 2026
@kyungsoo-datahub kyungsoo-datahub force-pushed the fix/redshift-query-stitching branch from fb62897 to 347a23b Compare March 3, 2026 21:26
@alwaysmeticulous

alwaysmeticulous Bot commented Mar 3, 2026

Copy link
Copy Markdown

🔴 Meticulous spotted visual differences in 1 of 1373 screens tested: view and approve differences detected.

Meticulous evaluated ~8 hours of user flows against your PR.

Last updated for commit d156355. This comment will update as new commits are pushed.

Extract _PROVISIONED_SEGMENT_SIZE (200) and _SERVERLESS_SEGMENT_SIZE (4000)
constants, replace hardcoded values in LISTAGG SQL with placeholders.
SELECT
query,
userid,
RTRIM(LISTAGG(RTRIM(text) || CASE WHEN LEN(RTRIM(text)) < {_PROVISIONED_SEGMENT_SIZE} THEN ' ' ELSE '' END, '')

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
Just wondering how is the implementation in PR is slightly different to the one suggested here

https://docs.aws.amazon.com/redshift/latest/dg/r_SVL_STATEMENTTEXT.html

select LISTAGG(CASE WHEN LEN(RTRIM(text)) = 0 THEN text ELSE RTRIM(text) END, '') within group (order by sequence) AS query_statement 
from SVL_STATEMENTTEXT where pid=pg_backend_pid();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AWS's approach doesn't fix the keyword merge problem. Their condition LEN(RTRIM(text)) = 0 only preserves completely empty segments. A segment like "GROUP " becomes "GROUP" (length 5, not 0), which merges with the next "BY" segment -> "GROUPBY"

Our approach uses segment size as a boundary detector: if LEN(RTRIM(text)) < segment_size, we add a space back because the segment likely reached a natural boundary (not just had trailing spaces stripped).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically, we are just doing string contains validations

So I would remove the focus on "stitching" and make this just a generic test for redshift queries

  • test_redshift_queries.py
  • class TestProvisionedQueries
  • class TestServerlessQueries
  • ...

So we can add more validations in the future

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Revised.

assert "LEN(RTRIM(querytxt)) = 0" not in sql


class TestBoundaryAwareStitchingLogic:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see much value on these tests using _simulate_boundary_aware_stitching. We could just remove them.

If we really want to test this in an integration test, we could add a eg long enough select in https://github.com/acryldata/connector-tests/blob/main/smoke-test/integration/create_data/redshift_data.py

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@sgomezvillamor sgomezvillamor left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@maggiehays maggiehays added pending-submitter-merge and removed needs-review Label for PRs that need review from a maintainer. labels Mar 4, 2026
@kyungsoo-datahub kyungsoo-datahub merged commit ab93f48 into master Mar 4, 2026
60 of 61 checks passed
@kyungsoo-datahub kyungsoo-datahub deleted the fix/redshift-query-stitching branch March 4, 2026 15:53
david-leifker pushed a commit that referenced this pull request May 27, 2026
- fix(ingest/teradata): set DATABASE context for view HELP commands (#16208)
- fix(redshift): use boundary-aware segment stitching for query reconstruction (#16253)
- fix(ingestion): update save button style (#16427)
- improvement(ui): design review changes for dataset summary and ingestion page (#16429)
- fix(ingestion/oracle): fix profiling crashes and silent table exclusions (#16396)
- docs(release): v0.3.16.5-acryl (#16428)
- fix(ui): Remove filter we don't support from run results tab (#16433)
- feat(agent-context): support sql search filters in mcp tools (#16403)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata pending-submitter-merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants