Skip to content

Commit 1d26914

Browse files
Added PG->DuckDB query language translation layer
1 parent 5e830a9 commit 1d26914

File tree

6 files changed

+1617
-3
lines changed

6 files changed

+1617
-3
lines changed
Lines changed: 386 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,386 @@
1+
# PostgreSQL to DuckDB JSON Query Translator
2+
3+
## Overview
4+
5+
This translator converts PostgreSQL JSON queries to DuckDB-compatible syntax, specifically handling differences in JSON operators, type casts, timestamp functions, and date operations.
6+
7+
## Files
8+
9+
- **[pg_to_duckdb_translator.hpp](pg_to_duckdb_translator.hpp)** - Header file with class declaration
10+
- **[pg_to_duckdb_translator.cpp](pg_to_duckdb_translator.cpp)** - Implementation
11+
- **[pg_to_duckdb_translator_test.cpp](pg_to_duckdb_translator_test.cpp)** - Test cases
12+
13+
## Translation Rules
14+
15+
### 1. JSON Operators
16+
- PostgreSQL: `json -> 'key'` or `json ->> 'key'`
17+
- DuckDB: `json->>'$.key'`
18+
- Chained: `json -> 'a' -> 'b' ->> 'c'``json->>'$.a.b.c'`
19+
20+
### 2. Type Casts
21+
- PostgreSQL: `(expr)::TYPE`, `identifier::TYPE`, or `'literal'::TYPE`
22+
- DuckDB: `CAST(expr AS TYPE)`
23+
- Supports compound types like `DOUBLE PRECISION`
24+
- Supports string literals: `'text'::varchar``CAST('text' AS varchar)`
25+
26+
### 3. Timestamp Conversions
27+
- PostgreSQL: `TIMESTAMP WITH TIME ZONE 'epoch' + INTERVAL '1 microsecond' * (json->>'time_us')::BIGINT`
28+
- DuckDB: `TO_TIMESTAMP(CAST(json->>'$.time_us' AS BIGINT) / 1000000)`
29+
30+
### 4. Date Extraction
31+
- PostgreSQL: `EXTRACT(HOUR FROM expr)`
32+
- DuckDB: `hour(expr)`
33+
- Supported fields: HOUR, DAY, MONTH, YEAR, MINUTE, SECOND, DOW, DOY, WEEK
34+
35+
### 5. Date Differences
36+
- PostgreSQL: `EXTRACT(EPOCH FROM (MAX(ts) - MIN(ts))) * 1000`
37+
- DuckDB: `date_diff('milliseconds', MIN(ts), MAX(ts))`
38+
- Multipliers: 1=seconds, 1000=milliseconds, 1000000=microseconds
39+
40+
### 6. IN Clauses
41+
- PostgreSQL: `expr IN ('a', 'b', 'c')`
42+
- DuckDB: `expr in ['a', 'b', 'c']`
43+
44+
### 7. WHERE Clause Predicates
45+
- PostgreSQL: `WHERE a = 1 AND b = 2`
46+
- DuckDB: `WHERE (a = 1) AND (b = 2)`
47+
- Each predicate separated by AND/OR is wrapped in parentheses
48+
49+
### 8. Date Formatting Functions
50+
- PostgreSQL: `to_date('2024-01-01', 'YYYY-MM-DD')`
51+
- DuckDB: `strptime('2024-01-01', 'YYYY-MM-DD')`
52+
53+
### 9. Integer Division
54+
- PostgreSQL: `DIV(a, b)`
55+
- DuckDB: `(a // b)`
56+
57+
### 10. Regular Expression Functions
58+
- PostgreSQL: `regexp_substr(string, pattern)`
59+
- DuckDB: `regexp_extract(string, pattern)`
60+
61+
## Usage
62+
63+
```cpp
64+
#include "pg_to_duckdb_translator.hpp"
65+
66+
std::string pg_query = "SELECT json -> 'commit' ->> 'collection' FROM table";
67+
std::string duckdb_query = pg::pg_to_duckdb_translator::translate(pg_query);
68+
```
69+
70+
## Integration
71+
72+
### Option 1: Integrate into DuckDB Executor
73+
74+
Add the translator to your existing query execution path:
75+
76+
```cpp
77+
// In duckdb_executor.cpp
78+
#include "pg_to_duckdb_translator.hpp"
79+
80+
std::unique_ptr<duckdb::MaterializedQueryResult>
81+
execute_query(duckdb::Connection& conn, const std::string& pg_query) {
82+
// Translate PostgreSQL syntax to DuckDB syntax
83+
std::string duckdb_query = pg::pg_to_duckdb_translator::translate(pg_query);
84+
85+
// Optional: Log the translation for debugging
86+
// elog(DEBUG1, "Original query: %s", pg_query.c_str());
87+
// elog(DEBUG1, "Translated query: %s", duckdb_query.c_str());
88+
89+
// Execute the translated query
90+
auto result = conn.Query(duckdb_query);
91+
92+
if (result->HasError()) {
93+
elog(ERROR, "DuckDB query failed: %s", result->GetError().c_str());
94+
}
95+
96+
return result;
97+
}
98+
```
99+
100+
### Option 2: Add as a PostgreSQL Hook
101+
102+
Create a query rewrite hook:
103+
104+
```cpp
105+
// In your extension initialization
106+
static ProcessUtility_hook_type prev_ProcessUtility = NULL;
107+
108+
static void deeplake_ProcessUtility(
109+
PlannedStmt *pstmt,
110+
const char *queryString,
111+
ProcessUtilityContext context,
112+
ParamListInfo params,
113+
QueryEnvironment *queryEnv,
114+
DestReceiver *dest,
115+
QueryCompletion *qc)
116+
{
117+
// Check if this is a SELECT query targeting your tables
118+
if (is_deeplake_query(queryString)) {
119+
// Translate the query
120+
std::string translated = pg::pg_to_duckdb_translator::translate(queryString);
121+
122+
// Execute via DuckDB
123+
execute_via_duckdb(translated);
124+
return;
125+
}
126+
127+
// Fall through to standard processing
128+
if (prev_ProcessUtility)
129+
prev_ProcessUtility(pstmt, queryString, context, params, queryEnv, dest, qc);
130+
else
131+
standard_ProcessUtility(pstmt, queryString, context, params, queryEnv, dest, qc);
132+
}
133+
134+
void _PG_init(void) {
135+
// Install the hook
136+
prev_ProcessUtility = ProcessUtility_hook;
137+
ProcessUtility_hook = deeplake_ProcessUtility;
138+
}
139+
```
140+
141+
### Option 3: Automatic Translation Flag
142+
143+
Add an environment variable or GUC to enable automatic translation:
144+
145+
```cpp
146+
// In extension_init.cpp or pg_deeplake.cpp
147+
static bool enable_pg_to_duckdb_translation = true;
148+
149+
void _PG_init(void) {
150+
// Define GUC variable
151+
DefineCustomBoolVariable(
152+
"deeplake.enable_query_translation",
153+
"Enable automatic PostgreSQL to DuckDB query translation",
154+
NULL,
155+
&enable_pg_to_duckdb_translation,
156+
true,
157+
PGC_USERSET,
158+
0,
159+
NULL, NULL, NULL
160+
);
161+
}
162+
163+
// Then in your query execution:
164+
std::string prepare_query(const std::string& query) {
165+
if (enable_pg_to_duckdb_translation) {
166+
return pg::pg_to_duckdb_translator::translate(query);
167+
}
168+
return query;
169+
}
170+
```
171+
172+
## Example Queries
173+
174+
### Date Formatting with to_date
175+
176+
**PostgreSQL:**
177+
```sql
178+
SELECT id, name
179+
FROM users
180+
WHERE signup_date >= to_date('2024-01-01', 'YYYY-MM-DD');
181+
```
182+
183+
**DuckDB:**
184+
```sql
185+
SELECT id, name
186+
FROM users
187+
WHERE (signup_date >= strptime('2024-01-01', 'YYYY-MM-DD'));
188+
```
189+
190+
### Integer Division with DIV
191+
192+
**PostgreSQL:**
193+
```sql
194+
SELECT id, DIV(total_amount, 100) AS dollars
195+
FROM transactions
196+
WHERE DIV(quantity, 10) > 5;
197+
```
198+
199+
**DuckDB:**
200+
```sql
201+
SELECT id, (total_amount // 100) AS dollars
202+
FROM transactions
203+
WHERE ((quantity // 10) > 5);
204+
```
205+
206+
### Regular Expression Extraction
207+
208+
**PostgreSQL:**
209+
```sql
210+
SELECT id, regexp_substr(email, '[^@]+') AS username
211+
FROM users
212+
WHERE regexp_substr(domain, '[a-z]+') = 'example';
213+
```
214+
215+
**DuckDB:**
216+
```sql
217+
SELECT id, regexp_extract(email, '[^@]+') AS username
218+
FROM users
219+
WHERE (regexp_extract(domain, '[a-z]+') = 'example');
220+
```
221+
222+
### Simple WHERE Clauses
223+
224+
**PostgreSQL:**
225+
```sql
226+
SELECT COUNT(*) FROM bluesky
227+
WHERE json ->> 'kind' = 'commit'
228+
AND json -> 'commit' ->> 'operation' = 'create';
229+
```
230+
231+
**DuckDB:**
232+
```sql
233+
SELECT COUNT(*) FROM bluesky
234+
WHERE json->>'$.kind' = 'commit'
235+
AND json->>'$.commit.operation' = 'create';
236+
```
237+
238+
### Type Casting in WHERE
239+
240+
**PostgreSQL:**
241+
```sql
242+
SELECT id, json ->> 'repo' as repo
243+
FROM bluesky
244+
WHERE (json ->> 'seq')::BIGINT > 1000000;
245+
```
246+
247+
**DuckDB:**
248+
```sql
249+
SELECT id, json->>'$.repo' as repo
250+
FROM bluesky
251+
WHERE CAST(json->>'$.seq' AS BIGINT) > 1000000;
252+
```
253+
254+
### IN Clauses
255+
256+
**PostgreSQL:**
257+
```sql
258+
SELECT COUNT(*) FROM bluesky
259+
WHERE json ->> 'kind' = 'commit'
260+
AND json -> 'commit' ->> 'collection' IN ('app.bsky.feed.post', 'app.bsky.feed.like');
261+
```
262+
263+
**DuckDB:**
264+
```sql
265+
SELECT COUNT(*) FROM bluesky
266+
WHERE json->>'$.kind' = 'commit'
267+
AND json->>'$.commit.collection' in ['app.bsky.feed.post', 'app.bsky.feed.like'];
268+
```
269+
270+
### Complex Aggregation with Timestamps
271+
272+
**PostgreSQL:**
273+
```sql
274+
SELECT json->>'did' AS user_id,
275+
MIN(TIMESTAMP WITH TIME ZONE 'epoch' +
276+
INTERVAL '1 microsecond' * (json->>'time_us')::BIGINT) AS first_post_ts
277+
FROM bluesky
278+
WHERE json->>'kind' = 'commit'
279+
GROUP BY user_id
280+
LIMIT 3;
281+
```
282+
283+
**DuckDB:**
284+
```sql
285+
SELECT json->>'$.did' AS user_id,
286+
MIN(TO_TIMESTAMP(CAST(json->>'$.time_us' AS BIGINT) / 1000000) ) AS first_post_ts
287+
FROM bluesky
288+
WHERE json->>'$.kind' = 'commit'
289+
GROUP BY user_id
290+
LIMIT 3;
291+
```
292+
293+
## Testing
294+
295+
Run the standalone test:
296+
```bash
297+
cd cpp/deeplake_pg
298+
chmod +x test_translator.sh
299+
./test_translator.sh
300+
```
301+
302+
All 24 test cases should pass:
303+
- ✓ Test 1: Simple JSON access with GROUP BY
304+
- ✓ Test 2: Multiple JSON access with WHERE
305+
- ✓ Test 3: EXTRACT HOUR with IN clause
306+
- ✓ Test 4: MIN with TIMESTAMP epoch conversion
307+
- ✓ Test 5: Date difference with EXTRACT EPOCH
308+
- ✓ Test 6: Simple WHERE with COUNT
309+
- ✓ Test 7: WHERE with nested JSON and string comparison
310+
- ✓ Test 8: WHERE with type cast
311+
- ✓ Test 9: WHERE with IN clause and multiple JSON operators
312+
- ✓ Test 10: Subquery with WHERE clause (regression test)
313+
- ✓ Test 11: to_date to strptime conversion
314+
- ✓ Test 12: DIV function to // operator
315+
- ✓ Test 13: regexp_substr to regexp_extract
316+
- ✓ Test 14: Combined date operations
317+
- ✓ Test 15: Arithmetic with DIV and type casts
318+
- ✓ Test 16: Pattern matching and string operations
319+
- ✓ Test 17: Case insensitive TO_DATE
320+
- ✓ Test 18: Nested functions with type casts
321+
- ✓ Test 19: Complex WHERE with mixed operators
322+
- ✓ Test 20: Multiple regexp_substr in SELECT and WHERE
323+
- ✓ Test 21: DOUBLE PRECISION compound type
324+
- ✓ Test 22: VARCHAR and TEXT type casts (real test from test_array.py)
325+
- ✓ Test 23: String concatenation with TEXT cast (real test from test_string_types.py)
326+
- ✓ Test 24: JSON column to TEXT cast (real test from test_json.py)
327+
328+
## Build Integration
329+
330+
The translator is automatically included in the PostgreSQL extension build via CMake's glob pattern in [CMakeLists.pg.cmake](../../CMakeLists.pg.cmake):
331+
332+
```cmake
333+
file(GLOB_RECURSE PG_SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/${PG_MODULE}/*.cpp" ...)
334+
```
335+
336+
No additional build configuration is needed - just rebuild the extension:
337+
```bash
338+
python3 scripts/build_pg_ext.py dev
339+
```
340+
341+
## Debugging
342+
343+
If queries fail after translation, enable logging:
344+
345+
```cpp
346+
#include "pg_to_duckdb_translator.hpp"
347+
348+
std::string query = "...";
349+
std::string translated = pg::pg_to_duckdb_translator::translate(query);
350+
351+
elog(INFO, "Original PostgreSQL query:");
352+
elog(INFO, "%s", query.c_str());
353+
elog(INFO, "Translated DuckDB query:");
354+
elog(INFO, "%s", translated.c_str());
355+
```
356+
357+
## Performance Considerations
358+
359+
- **Translation overhead:** ~0.1-1ms per query (depends on query complexity)
360+
- **Caching:** For queries executed multiple times, consider caching translations
361+
- **Execution impact:** No impact on query execution time (translation happens once before execution)
362+
- **Implementation:** Uses regex-based pattern matching with manual parenthesis balancing for complex expressions
363+
- **Memory:** Minimal allocation - mostly string operations
364+
365+
## Rollback Plan
366+
367+
If you encounter issues:
368+
1. Set `deeplake.enable_query_translation = false` in postgresql.conf
369+
2. Or remove the translation call from your code temporarily
370+
3. Queries will be passed to DuckDB unchanged (may require manual DuckDB syntax)
371+
372+
## Limitations
373+
374+
- Does not handle all PostgreSQL syntax - focused on common JSON query patterns
375+
- Nested CAST expressions beyond 2 levels may need testing
376+
- Comments in SQL queries are not preserved
377+
- Multi-statement queries are translated as a whole (no statement separation)
378+
379+
## Future Enhancements
380+
381+
Possible improvements:
382+
1. Add support for more PostgreSQL-specific functions
383+
2. Handle JSON array indexing: `json->'array'->0`
384+
3. Support JSONB operators
385+
4. Add query validation/linting
386+
5. Performance profiling and optimization

0 commit comments

Comments
 (0)