Skip to content

Commit 663413b

Browse files
Fix DuckDB identifier and struct value quoting for special characters (#297)
## Summary Harden the pg_lake → pgduck_server SQL emission paths against identifier quoting and struct field-name escaping mismatches between PostgreSQL and DuckDB. ### Identifier quoting - Add `duckdb_quote_identifier()` that quotes identifiers reserved in DuckDB but not in PostgreSQL — both RESERVED_KEYWORD (`pivot`, `qualify`, `lambda`, `summarize`, from DuckDB's vendored `kwlist.hpp` via `tools/generate_duckdb_kwlist.py`. CI now runs `make check-duckdb-kwlist` to catch a stale table after a DuckDB bump; the script falls back to fetching the file from `raw.githubusercontent.com` at the pinned submodule SHA so CI doesn't need the submodule initialised. - Route every relation / schema / column / function / operator-namespace name through `duckdb_quote_identifier()` in both the ruleutils deparse path (`deparse_ruleutils.c`) and the postgres_fdw-style deparser (`deparse.c`). Fixes `pivot.t` style schema-qualified names appearing unquoted in `EXPLAIN (VERBOSE)` output. - `deparse_ruleutils.c` uses `set_config_option("quote_all_identifiers", "on", …, GUC_ACTION_SAVE, …)` so `AtEOXact_GUC` unwinds the GUC stack even if `pg_get_querydef` throws. ### STRUCT field-name handling - `QuoteDuckDBFieldName` emits SQL-standard `""` doubling (what DuckDB and SQL both expect) instead of C-style `\"` backslash escaping. - `ParseDuckDBFieldName` rewritten to consume SQL-standard `""` doubling; the paren-depth scanner in `ParseDuckDBFieldType` skips over quoted regions so embedded `(` / `)` in field names don't break struct parsing. - `StructOutForPGDuck` uses DuckDB-compatible backslash escaping for struct literal **values** travelling through CSV, because that context needs to round-trip through the outer CSV layer. - Two helpers exposed via `struct_conversion.h`: - `QuoteDuckDBStructKey` (backslash escaping) for struct literals emitted as CSV values — the `struct_conversion.c:StructOutForPGDuck` path. - `QuoteDuckDBStructKeySQL` (SQL-standard `''` doubling) for struct literals concatenated directly into SQL query text — the `read_data.c:TupleDescToStructProjection` path, including the INTERVAL / struct-of-INTERVAL / INTERVAL[] emit branches. Previously used bare `'%s':` which truncated on any field name containing `'`. ### Composite-type catalog lookup - `GetOrCreatePGStructType` previously `psprintf`'d composite field names directly into a `'{...}'::name[]` array literal, producing "malformed array literal" for field names containing `,`, `:`, quotes, spaces, parens, or backslashes during auto-detect. Switched to `SPI_execute_with_args` with the names and oids passed as proper `name[]` / `oid[]` Datums constructed via `construct_array_builtin`. ### CI / infrastructure - Artifact name for `postgres-logs` upload now includes `worker_id` (when the matrix supplies one) and `from_pg_version` (for the upgrade matrix), so parallel matrix jobs stop colliding on the same artifact name. - Lint workflow runs `make check-duckdb-kwlist`. ## Test plan ### Reserved-keyword column names - `test_{read,read_csv,read_json,writable_parquet,iceberg_table}_with_reserved_keyword_columns` - `test_each_reserved_keyword_{parquet,csv,json}[kw]` — parameterized across every DuckDB-only reserved keyword; covers SELECT + WHERE pushdown + EXPLAIN-quoted verification - `test_iceberg_table_with_reserved_keyword_columns`, `test_iceberg_table_insert_select_all_reserved_keywords`, `test_readable_iceberg_foreign_table_with_reserved_keyword_columns` - `test_copy_roundtrip_{parquet,csv,json}[kw]` (in `pg_lake_copy`) ### Typed struct fields with reserved-keyword names - `test_reserved_keyword_struct_field_typed[interval|timestamp|interval[]]` — INTERVAL / TIMESTAMP / INTERVAL[] emit branches in `read_data.c` - `test_reserved_keyword_geometry_column` — iceberg FDW with a `pivot geometry` column; SELECT + filter pushdown - `test_reserved_keyword_geometry_roundtrip[parquet|csv|json]` — parameterized round-trip exercising `ST_AsWKB` / `ST_AsGeoJSON` / `ST_GeomFromWKB` / `ST_GeomFromText` / `ST_GeomFromGeoJSON` with reserved column names - `test_iceberg_map_of_interval_with_reserved_column` — MAP-of-INTERVAL column named `pivot` - `test_iceberg_timetz_to_time_cast_on_reserved_column` — `write_data.c` TIMETZ→TIME CAST branch - `test_iceberg_overflow_conversion_on_reserved_column` — type-widening CAST projection path - `test_s3log_strptime_with_reserved_column_name` — LOG-format foreign table hitting the `strptime()` wrapper branch ### STRUCT field-name edge cases - `test_iceberg_composite_field_with_special_characters` — spaces, `"`, `'`, `\` in field names - `test_iceberg_composite_field_with_embedded_double_quote` — `U&"has\0022quote"` - `test_struct_of_interval_field_name_with_single_quote` — struct-of-INTERVAL with a `'`-bearing field name - `test_autodetect_struct_field_names_with_hostile_characters` — `CREATE FOREIGN TABLE () SERVER pg_lake OPTIONS (path …)` against a parquet file whose struct fields have commas, colons, quotes, spaces, parens, backslashes (covers the SPI parameter-binding fix) - `test_copy_roundtrip_composite_with_embedded_quote`, `test_copy_roundtrip_csv_composite_with_embedded_quote` — parquet/CSV round-trips ### Other pg_lake_engine paths - `test_iceberg_analyze_with_reserved_columns` — ANALYZE on iceberg FDW with reserved-keyword columns - `test_reserved_keyword_schema_with_custom_function_and_operator` — schema-qualified function/operator deparse with a reserved schema name
1 parent bbd1176 commit 663413b

59 files changed

Lines changed: 5020 additions & 988 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/actions/run-test/action.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -140,7 +140,7 @@ runs:
140140
uses: actions/upload-artifact@v4
141141
if: always()
142142
with:
143-
name: postgres-logs-${{ inputs.pg_version }}-${{ inputs.test_make_target }}
143+
name: postgres-logs-${{ inputs.pg_version }}-${{ inputs.test_make_target }}${{ inputs.from_pg_version != '' && format('-from{0}', inputs.from_pg_version) || '' }}${{ inputs.worker_id != '' && format('-{0}', inputs.worker_id) || '' }}
144144
path: |
145145
/tmp/pg_lake_tests/logfile
146146
/tmp/pg_installcheck_tests/logfile

.github/workflows/lint.yml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,3 +53,7 @@ jobs:
5353
- name: Check formatting
5454
run: |
5555
make check-indent
56+
57+
- name: Check DuckDB keyword table is in sync
58+
run: |
59+
make check-duckdb-kwlist

Makefile

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ CUSTOM_TARGETS = check-pg_lake_engine installcheck-pg_lake_engine check-pg_exten
1414
DUCKDB_BUILD_USE_CACHE ?= 0
1515

1616
# other phony targets go here
17-
.PHONY: all fast install install-fast installcheck clean check submodules uninstall check-indent reindent installcheck-postgres installcheck-postgres-with_extensions_created
17+
.PHONY: all fast install install-fast installcheck clean check submodules uninstall check-indent reindent installcheck-postgres installcheck-postgres-with_extensions_created generate-duckdb-kwlist check-duckdb-kwlist
1818
.PHONY: $(ALL_TARGETS)
1919
.PHONY: $(PHONY_TARGETS)
2020

@@ -222,6 +222,17 @@ uninstall-avro:
222222
rm -f $(PG_LIBDIR)/libavro.*
223223
rm -rf $(PG_INCLUDEDIR)/avro*
224224

225+
## DuckDB keyword list maintenance
226+
# Regenerate the checked-in keyword table from the vendored DuckDB kwlist.hpp.
227+
# Re-run whenever duckdb_pglake/duckdb is updated to a new DuckDB release.
228+
generate-duckdb-kwlist:
229+
python3 tools/generate_duckdb_kwlist.py
230+
231+
# Verify that the checked-in keyword table matches the current kwlist.hpp.
232+
# Run in CI to catch stale keyword tables after a DuckDB version bump.
233+
check-duckdb-kwlist:
234+
python3 tools/generate_duckdb_kwlist.py --check
235+
225236
## Other targets
226237
check-isolation_pg_lake_table:
227238
$(MAKE) -C pg_lake_table check-isolation
Lines changed: 339 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,339 @@
1+
"""
2+
Tests for correct identifier quoting during COPY TO / COPY FROM when table
3+
columns are named after DuckDB reserved keywords.
4+
5+
DuckDB reserves several keywords (PIVOT, QUALIFY, LAMBDA, etc.) that
6+
PostgreSQL does not. These are legal unquoted column names in PostgreSQL but
7+
require quoting in the SQL that pg_lake generates for pgduck_server.
8+
9+
See: https://github.com/Snowflake-Labs/pg_lake/issues/277
10+
"""
11+
12+
import pytest
13+
from utils_pytest import *
14+
15+
# DuckDB-only reserved keywords that are safe as unquoted PostgreSQL
16+
# identifiers.
17+
DUCKDB_ONLY_RESERVED = [
18+
"describe",
19+
"lambda",
20+
"pivot",
21+
"pivot_longer",
22+
"pivot_wider",
23+
"qualify",
24+
"show",
25+
"summarize",
26+
"unpivot",
27+
]
28+
29+
# Smaller set used in multi-column tests for conciseness.
30+
TEST_KEYWORDS = ["pivot", "qualify", "lambda", "show"]
31+
32+
33+
def test_copy_to_parquet_with_reserved_keyword_columns(pg_conn, s3):
34+
"""
35+
COPY ... TO parquet must succeed when the table has DuckDB-reserved
36+
keyword column names, and COPY ... FROM must read the file back correctly.
37+
"""
38+
url = f"s3://{TEST_BUCKET}/test_kw_copy_parquet/kw_cols.parquet"
39+
40+
try:
41+
run_command(
42+
"""
43+
CREATE TABLE test_kw_copy_src (
44+
pivot int,
45+
qualify int,
46+
lambda int,
47+
show int
48+
);
49+
INSERT INTO test_kw_copy_src VALUES (1, 2, 3, 4), (10, 20, 30, 40);
50+
""",
51+
pg_conn,
52+
)
53+
54+
run_command(
55+
f"COPY test_kw_copy_src TO '{url}' WITH (format 'parquet')",
56+
pg_conn,
57+
)
58+
59+
run_command(
60+
f"""
61+
CREATE TABLE test_kw_copy_dst (
62+
pivot int,
63+
qualify int,
64+
lambda int,
65+
show int
66+
);
67+
COPY test_kw_copy_dst FROM '{url}' WITH (format 'parquet');
68+
""",
69+
pg_conn,
70+
)
71+
72+
result = run_query(
73+
"SELECT pivot, qualify, lambda, show FROM test_kw_copy_dst ORDER BY pivot",
74+
pg_conn,
75+
)
76+
assert result == [[1, 2, 3, 4], [10, 20, 30, 40]]
77+
finally:
78+
pg_conn.rollback()
79+
80+
81+
def test_copy_to_csv_with_reserved_keyword_columns(pg_conn, s3):
82+
"""
83+
COPY ... TO CSV and back must work with DuckDB-reserved keyword columns.
84+
"""
85+
url = f"s3://{TEST_BUCKET}/test_kw_copy_csv/kw_cols.csv"
86+
87+
try:
88+
run_command(
89+
f"""
90+
CREATE TABLE test_kw_csv_src (pivot int, qualify int, lambda int, show int);
91+
INSERT INTO test_kw_csv_src VALUES (3, 6, 9, 12), (30, 60, 90, 120);
92+
COPY test_kw_csv_src TO '{url}' WITH (format 'csv', header on);
93+
94+
CREATE TABLE test_kw_csv_dst (pivot int, qualify int, lambda int, show int);
95+
COPY test_kw_csv_dst FROM '{url}' WITH (format 'csv', header on);
96+
""",
97+
pg_conn,
98+
)
99+
100+
result = run_query(
101+
"SELECT pivot, qualify, lambda, show FROM test_kw_csv_dst ORDER BY pivot",
102+
pg_conn,
103+
)
104+
assert result == [[3, 6, 9, 12], [30, 60, 90, 120]]
105+
finally:
106+
pg_conn.rollback()
107+
108+
109+
def test_copy_to_json_with_reserved_keyword_columns(pg_conn, s3):
110+
"""
111+
COPY ... TO JSON and back must work with DuckDB-reserved keyword columns.
112+
"""
113+
url = f"s3://{TEST_BUCKET}/test_kw_copy_json/kw_cols.json"
114+
115+
try:
116+
run_command(
117+
f"""
118+
CREATE TABLE test_kw_json_src (pivot int, qualify int, lambda int, show int);
119+
INSERT INTO test_kw_json_src VALUES (7, 14, 21, 28), (70, 140, 210, 280);
120+
COPY test_kw_json_src TO '{url}' WITH (format 'json');
121+
122+
CREATE TABLE test_kw_json_dst (pivot int, qualify int, lambda int, show int);
123+
COPY test_kw_json_dst FROM '{url}' WITH (format 'json');
124+
""",
125+
pg_conn,
126+
)
127+
128+
result = run_query(
129+
"SELECT pivot, qualify, lambda, show FROM test_kw_json_dst ORDER BY pivot",
130+
pg_conn,
131+
)
132+
assert result == [[7, 14, 21, 28], [70, 140, 210, 280]]
133+
finally:
134+
pg_conn.rollback()
135+
136+
137+
# ---------------------------------------------------------------------------
138+
# Composite type field names with embedded double-quotes
139+
# ---------------------------------------------------------------------------
140+
141+
142+
def test_copy_roundtrip_composite_with_embedded_quote(pg_conn, s3):
143+
"""
144+
COPY TO/FROM must handle composite types whose field names contain
145+
double-quote characters. The STRUCT type definition sent to DuckDB must
146+
properly escape the embedded quotes.
147+
148+
Reproduces: https://github.com/snowflake-eng/sfpg-extension-pg_lake_replication/issues/361
149+
"""
150+
url = f"s3://{TEST_BUCKET}/test_kw_copy_composite_quote/data.parquet"
151+
152+
try:
153+
run_command(
154+
"""
155+
CREATE TYPE test_kw_composite_quote AS (
156+
U&"has\\0022quote" text,
157+
normal int,
158+
"has'single" int,
159+
U&"has\\005Cback" int
160+
)
161+
""",
162+
pg_conn,
163+
)
164+
165+
run_command(
166+
"""
167+
CREATE TABLE test_kw_cq_src (id int, s test_kw_composite_quote);
168+
INSERT INTO test_kw_cq_src VALUES (1, ROW('hello', 42, 7, 8));
169+
""",
170+
pg_conn,
171+
)
172+
173+
run_command(
174+
f"COPY test_kw_cq_src TO '{url}' WITH (format 'parquet')",
175+
pg_conn,
176+
)
177+
178+
run_command(
179+
f"""
180+
CREATE TABLE test_kw_cq_dst (id int, s test_kw_composite_quote);
181+
COPY test_kw_cq_dst FROM '{url}' WITH (format 'parquet');
182+
""",
183+
pg_conn,
184+
)
185+
186+
result = run_query(
187+
"SELECT id, (s).normal FROM test_kw_cq_dst",
188+
pg_conn,
189+
)
190+
assert result == [[1, 42]]
191+
192+
result = run_query(
193+
'SELECT (s).U&"has\\0022quote" FROM test_kw_cq_dst',
194+
pg_conn,
195+
)
196+
assert result == [["hello"]]
197+
198+
result = run_query(
199+
"""SELECT (s)."has'single", (s).U&"has\\005Cback" FROM test_kw_cq_dst""",
200+
pg_conn,
201+
)
202+
assert result == [[7, 8]]
203+
finally:
204+
pg_conn.rollback()
205+
206+
207+
def test_copy_roundtrip_csv_composite_with_embedded_quote(pg_conn, s3):
208+
"""
209+
COPY TO/FROM CSV with composite types containing double-quote field names.
210+
This exercises the read_csv columns= type string path.
211+
212+
Reproduces: https://github.com/snowflake-eng/sfpg-extension-pg_lake_replication/issues/361
213+
"""
214+
url = f"s3://{TEST_BUCKET}/test_kw_copy_csv_composite_quote/data.csv"
215+
216+
try:
217+
run_command(
218+
"""
219+
CREATE TYPE test_kw_csv_cq AS (
220+
U&"has\\0022quote" text,
221+
normal int,
222+
"has'single" int,
223+
U&"has\\005Cback" int
224+
)
225+
""",
226+
pg_conn,
227+
)
228+
229+
run_command(
230+
"""
231+
CREATE TABLE test_kw_csv_cq_src (id int, s test_kw_csv_cq);
232+
INSERT INTO test_kw_csv_cq_src VALUES (1, ROW('world', 99, 11, 12));
233+
""",
234+
pg_conn,
235+
)
236+
237+
run_command(
238+
f"COPY test_kw_csv_cq_src TO '{url}' WITH (format 'csv', header on)",
239+
pg_conn,
240+
)
241+
242+
run_command(
243+
f"""
244+
CREATE TABLE test_kw_csv_cq_dst (id int, s test_kw_csv_cq);
245+
COPY test_kw_csv_cq_dst FROM '{url}' WITH (format 'csv', header on);
246+
""",
247+
pg_conn,
248+
)
249+
250+
result = run_query(
251+
"SELECT id, (s).normal FROM test_kw_csv_cq_dst",
252+
pg_conn,
253+
)
254+
assert result == [[1, 99]]
255+
256+
result = run_query(
257+
"""SELECT (s)."has'single", (s).U&"has\\005Cback" FROM test_kw_csv_cq_dst""",
258+
pg_conn,
259+
)
260+
assert result == [[11, 12]]
261+
finally:
262+
pg_conn.rollback()
263+
264+
265+
# ---------------------------------------------------------------------------
266+
# Parameterised round-trip tests — each keyword, each format
267+
# ---------------------------------------------------------------------------
268+
269+
270+
@pytest.mark.parametrize("keyword", DUCKDB_ONLY_RESERVED)
271+
def test_copy_roundtrip_parquet(pg_conn, s3, keyword):
272+
"""Each DuckDB-only reserved keyword survives a parquet round-trip."""
273+
url = f"s3://{TEST_BUCKET}/test_kw_roundtrip_parquet/{keyword}/data.parquet"
274+
275+
try:
276+
run_command(
277+
f"""
278+
CREATE TABLE test_kw_pq_src_{keyword} ({keyword} int);
279+
INSERT INTO test_kw_pq_src_{keyword} VALUES (42);
280+
COPY test_kw_pq_src_{keyword} TO '{url}' WITH (format 'parquet');
281+
282+
CREATE TABLE test_kw_pq_dst_{keyword} ({keyword} int);
283+
COPY test_kw_pq_dst_{keyword} FROM '{url}' WITH (format 'parquet');
284+
""",
285+
pg_conn,
286+
)
287+
288+
result = run_query(f"SELECT {keyword} FROM test_kw_pq_dst_{keyword}", pg_conn)
289+
assert result == [[42]]
290+
finally:
291+
pg_conn.rollback()
292+
293+
294+
@pytest.mark.parametrize("keyword", DUCKDB_ONLY_RESERVED)
295+
def test_copy_roundtrip_csv(pg_conn, s3, keyword):
296+
"""Each DuckDB-only reserved keyword survives a CSV round-trip."""
297+
url = f"s3://{TEST_BUCKET}/test_kw_roundtrip_csv/{keyword}/data.csv"
298+
299+
try:
300+
run_command(
301+
f"""
302+
CREATE TABLE test_kw_csv_src_{keyword} ({keyword} int);
303+
INSERT INTO test_kw_csv_src_{keyword} VALUES (42);
304+
COPY test_kw_csv_src_{keyword} TO '{url}' WITH (format 'csv', header on);
305+
306+
CREATE TABLE test_kw_csv_dst_{keyword} ({keyword} int);
307+
COPY test_kw_csv_dst_{keyword} FROM '{url}' WITH (format 'csv', header on);
308+
""",
309+
pg_conn,
310+
)
311+
312+
result = run_query(f"SELECT {keyword} FROM test_kw_csv_dst_{keyword}", pg_conn)
313+
assert result == [[42]]
314+
finally:
315+
pg_conn.rollback()
316+
317+
318+
@pytest.mark.parametrize("keyword", DUCKDB_ONLY_RESERVED)
319+
def test_copy_roundtrip_json(pg_conn, s3, keyword):
320+
"""Each DuckDB-only reserved keyword survives a JSON round-trip."""
321+
url = f"s3://{TEST_BUCKET}/test_kw_roundtrip_json/{keyword}/data.json"
322+
323+
try:
324+
run_command(
325+
f"""
326+
CREATE TABLE test_kw_json_src_{keyword} ({keyword} int);
327+
INSERT INTO test_kw_json_src_{keyword} VALUES (42);
328+
COPY test_kw_json_src_{keyword} TO '{url}' WITH (format 'json');
329+
330+
CREATE TABLE test_kw_json_dst_{keyword} ({keyword} int);
331+
COPY test_kw_json_dst_{keyword} FROM '{url}' WITH (format 'json');
332+
""",
333+
pg_conn,
334+
)
335+
336+
result = run_query(f"SELECT {keyword} FROM test_kw_json_dst_{keyword}", pg_conn)
337+
assert result == [[42]]
338+
finally:
339+
pg_conn.rollback()

pg_lake_engine/include/pg_lake/pgduck/keywords.h

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -17,4 +17,14 @@
1717

1818
#pragma once
1919

20-
PGDLLEXPORT bool IsDuckDBReservedWord(char *candidateWord);
20+
/*
21+
* duckdb_quote_identifier — like quote_identifier() but also quotes
22+
* identifiers that are reserved (in any non-UNRESERVED category) in DuckDB
23+
* but not in PostgreSQL — e.g. LAMBDA, PIVOT, QUALIFY, SUMMARIZE, DESCRIBE,
24+
* SHOW, UNPIVOT (RESERVED_KEYWORD) as well as ASOF, ANTI, GLOB
25+
* (COL_NAME_KEYWORD / TYPE_FUNC_NAME_KEYWORD).
26+
*
27+
* Use this for all identifiers (column names, field names, relation names)
28+
* that will appear in SQL sent to pgduck_server.
29+
*/
30+
PGDLLEXPORT const char *duckdb_quote_identifier(const char *ident);

0 commit comments

Comments
 (0)