fix(redshift): use boundary-aware segment stitching for query reconstruction#16253
Conversation
…ruction Redshift stores queries in fixed-width character(200) or character(4000) segments. RTRIM per-segment and RTRIM(LISTAGG(text)) both fail because character(n) padding is stripped before LISTAGG receives values, merging keywords at boundaries (e.g. GROUP BY -> GROUPBY). Fix: add a space back when trimmed segment length < segment size. Applied to all 5 LISTAGG locations. Also replaced stl_query.querytxt (truncated to 4000 chars) with a CTE from STL_QUERYTEXT in provisioned scan lineage.
|
Linear: ING-1654 |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Linear: ING-1802 |
Connector Tests ResultsConnector tests failed for commit Autogenerated by the connector-tests CI pipeline. |
|
My overall concern is that we’re adding quite a lot of complexity (and therefore risk of errors) to address what seems to be an anecdotal issue. How recurrent is this issue? Beyond complexity, proposed solution is not perfect. The keyword-adjacency heuristic is fragile. We’d need a comprehensive reserved word list, and edge cases will grow quickly. For example: -- GROUP as a keyword — needs space
SELECT a, COUNT(*) FROM t GROUP<CHR1>BY a
-- GROUP as an identifier — space is harmless but detection logic
-- must correctly classify it, which requires parsing context
SELECT group<CHR1>_id FROM t -- identifier, no space neededThis would require bidirectional lookahead around every marker, which increases complexity even more. The reserved word list itself is also a maintenance burden. Redshift’s reserved words may change across versions, and include non-obvious entries. Keeping that list correct and up to date is operational overhead. Additionally, there’s the risk of false positives from string literals. We currently read from There is also That view is built on top of Does If it’s not supported in serverless (given the 4K char limit there), is this issue still frequent enough that we must address it? Another alternative would be audit logging to S3: https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing.html From what I understand, the query statement is not split there. But this would be a completely different ingestion approach. I agree this may not feasible in the short term. My current view:
IMO final decision depends on how recurrent is the issue. |
fb62897 to
347a23b
Compare
|
🔴 Meticulous spotted visual differences in 1 of 1373 screens tested: view and approve differences detected. Meticulous evaluated ~8 hours of user flows against your PR. Last updated for commit d156355. This comment will update as new commits are pushed. |
Extract _PROVISIONED_SEGMENT_SIZE (200) and _SERVERLESS_SEGMENT_SIZE (4000) constants, replace hardcoded values in LISTAGG SQL with placeholders.
| SELECT | ||
| query, | ||
| userid, | ||
| RTRIM(LISTAGG(RTRIM(text) || CASE WHEN LEN(RTRIM(text)) < {_PROVISIONED_SEGMENT_SIZE} THEN ' ' ELSE '' END, '') |
There was a problem hiding this comment.
LGTM
Just wondering how is the implementation in PR is slightly different to the one suggested here
https://docs.aws.amazon.com/redshift/latest/dg/r_SVL_STATEMENTTEXT.html
select LISTAGG(CASE WHEN LEN(RTRIM(text)) = 0 THEN text ELSE RTRIM(text) END, '') within group (order by sequence) AS query_statement
from SVL_STATEMENTTEXT where pid=pg_backend_pid();
There was a problem hiding this comment.
AWS's approach doesn't fix the keyword merge problem. Their condition LEN(RTRIM(text)) = 0 only preserves completely empty segments. A segment like "GROUP " becomes "GROUP" (length 5, not 0), which merges with the next "BY" segment -> "GROUPBY"
Our approach uses segment size as a boundary detector: if LEN(RTRIM(text)) < segment_size, we add a space back because the segment likely reached a natural boundary (not just had trailing spaces stripped).
There was a problem hiding this comment.
Basically, we are just doing string contains validations
So I would remove the focus on "stitching" and make this just a generic test for redshift queries
test_redshift_queries.pyclass TestProvisionedQueriesclass TestServerlessQueries- ...
So we can add more validations in the future
| assert "LEN(RTRIM(querytxt)) = 0" not in sql | ||
|
|
||
|
|
||
| class TestBoundaryAwareStitchingLogic: |
There was a problem hiding this comment.
I don't see much value on these tests using _simulate_boundary_aware_stitching. We could just remove them.
If we really want to test this in an integration test, we could add a eg long enough select in https://github.com/acryldata/connector-tests/blob/main/smoke-test/integration/create_data/redshift_data.py
- fix(ingest/teradata): set DATABASE context for view HELP commands (#16208) - fix(redshift): use boundary-aware segment stitching for query reconstruction (#16253) - fix(ingestion): update save button style (#16427) - improvement(ui): design review changes for dataset summary and ingestion page (#16429) - fix(ingestion/oracle): fix profiling crashes and silent table exclusions (#16396) - docs(release): v0.3.16.5-acryl (#16428) - fix(ui): Remove filter we don't support from run results tab (#16433) - feat(agent-context): support sql search filters in mcp tools (#16403)
Redshift stores queries in fixed-width segments (200 chars for provisioned, 4000 for serverless). When using LISTAGG with per-segment RTRIM, tokens at boundaries merge without spaces (GROUP BY → GROUPBY).
Fix: Insert a space when trimmed segment length is less than segment size (indicates padding). Applied to all 6 LISTAGG expressions in both modes.
Also: Replace stl_query.querytxt (truncated to 4000 chars) with STL_QUERYTEXT CTE in provisioned scan-based lineage.
Added tests for boundary detection and segment size correctness.