|
| 1 | +# PostgreSQL to DuckDB JSON Query Translator |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This translator converts PostgreSQL JSON queries to DuckDB-compatible syntax, specifically handling differences in JSON operators, type casts, timestamp functions, and date operations. |
| 6 | + |
| 7 | +## Files |
| 8 | + |
| 9 | +- **[pg_to_duckdb_translator.hpp](pg_to_duckdb_translator.hpp)** - Header file with class declaration |
| 10 | +- **[pg_to_duckdb_translator.cpp](pg_to_duckdb_translator.cpp)** - Implementation |
| 11 | +- **[pg_to_duckdb_translator_test.cpp](pg_to_duckdb_translator_test.cpp)** - Test cases |
| 12 | + |
| 13 | +## Translation Rules |
| 14 | + |
| 15 | +### 1. JSON Operators |
| 16 | +- PostgreSQL: `json -> 'key'` or `json ->> 'key'` |
| 17 | +- DuckDB: `json->>'$.key'` |
| 18 | +- Chained: `json -> 'a' -> 'b' ->> 'c'` → `json->>'$.a.b.c'` |
| 19 | + |
| 20 | +### 2. Type Casts |
| 21 | +- PostgreSQL: `(expr)::TYPE`, `identifier::TYPE`, or `'literal'::TYPE` |
| 22 | +- DuckDB: `CAST(expr AS TYPE)` |
| 23 | +- Supports compound types like `DOUBLE PRECISION` |
| 24 | +- Supports string literals: `'text'::varchar` → `CAST('text' AS varchar)` |
| 25 | + |
| 26 | +### 3. Timestamp Conversions |
| 27 | +- PostgreSQL: `TIMESTAMP WITH TIME ZONE 'epoch' + INTERVAL '1 microsecond' * (json->>'time_us')::BIGINT` |
| 28 | +- DuckDB: `TO_TIMESTAMP(CAST(json->>'$.time_us' AS BIGINT) / 1000000)` |
| 29 | + |
| 30 | +### 4. Date Extraction |
| 31 | +- PostgreSQL: `EXTRACT(HOUR FROM expr)` |
| 32 | +- DuckDB: `hour(expr)` |
| 33 | +- Supported fields: HOUR, DAY, MONTH, YEAR, MINUTE, SECOND, DOW, DOY, WEEK |
| 34 | + |
| 35 | +### 5. Date Differences |
| 36 | +- PostgreSQL: `EXTRACT(EPOCH FROM (MAX(ts) - MIN(ts))) * 1000` |
| 37 | +- DuckDB: `date_diff('milliseconds', MIN(ts), MAX(ts))` |
| 38 | +- Multipliers: 1=seconds, 1000=milliseconds, 1000000=microseconds |
| 39 | + |
| 40 | +### 6. IN Clauses |
| 41 | +- PostgreSQL: `expr IN ('a', 'b', 'c')` |
| 42 | +- DuckDB: `expr in ['a', 'b', 'c']` |
| 43 | + |
| 44 | +### 7. WHERE Clause Predicates |
| 45 | +- PostgreSQL: `WHERE a = 1 AND b = 2` |
| 46 | +- DuckDB: `WHERE (a = 1) AND (b = 2)` |
| 47 | +- Each predicate separated by AND/OR is wrapped in parentheses |
| 48 | + |
| 49 | +### 8. Date Formatting Functions |
| 50 | +- PostgreSQL: `to_date('2024-01-01', 'YYYY-MM-DD')` |
| 51 | +- DuckDB: `strptime('2024-01-01', 'YYYY-MM-DD')` |
| 52 | + |
| 53 | +### 9. Integer Division |
| 54 | +- PostgreSQL: `DIV(a, b)` |
| 55 | +- DuckDB: `(a // b)` |
| 56 | + |
| 57 | +### 10. Regular Expression Functions |
| 58 | +- PostgreSQL: `regexp_substr(string, pattern)` |
| 59 | +- DuckDB: `regexp_extract(string, pattern)` |
| 60 | + |
| 61 | +## Usage |
| 62 | + |
| 63 | +```cpp |
| 64 | +#include "pg_to_duckdb_translator.hpp" |
| 65 | + |
| 66 | +std::string pg_query = "SELECT json -> 'commit' ->> 'collection' FROM table"; |
| 67 | +std::string duckdb_query = pg::pg_to_duckdb_translator::translate(pg_query); |
| 68 | +``` |
| 69 | + |
| 70 | +## Integration |
| 71 | + |
| 72 | +### Option 1: Integrate into DuckDB Executor |
| 73 | + |
| 74 | +Add the translator to your existing query execution path: |
| 75 | + |
| 76 | +```cpp |
| 77 | +// In duckdb_executor.cpp |
| 78 | +#include "pg_to_duckdb_translator.hpp" |
| 79 | + |
| 80 | +std::unique_ptr<duckdb::MaterializedQueryResult> |
| 81 | +execute_query(duckdb::Connection& conn, const std::string& pg_query) { |
| 82 | + // Translate PostgreSQL syntax to DuckDB syntax |
| 83 | + std::string duckdb_query = pg::pg_to_duckdb_translator::translate(pg_query); |
| 84 | + |
| 85 | + // Optional: Log the translation for debugging |
| 86 | + // elog(DEBUG1, "Original query: %s", pg_query.c_str()); |
| 87 | + // elog(DEBUG1, "Translated query: %s", duckdb_query.c_str()); |
| 88 | + |
| 89 | + // Execute the translated query |
| 90 | + auto result = conn.Query(duckdb_query); |
| 91 | + |
| 92 | + if (result->HasError()) { |
| 93 | + elog(ERROR, "DuckDB query failed: %s", result->GetError().c_str()); |
| 94 | + } |
| 95 | + |
| 96 | + return result; |
| 97 | +} |
| 98 | +``` |
| 99 | +
|
| 100 | +### Option 2: Add as a PostgreSQL Hook |
| 101 | +
|
| 102 | +Create a query rewrite hook: |
| 103 | +
|
| 104 | +```cpp |
| 105 | +// In your extension initialization |
| 106 | +static ProcessUtility_hook_type prev_ProcessUtility = NULL; |
| 107 | +
|
| 108 | +static void deeplake_ProcessUtility( |
| 109 | + PlannedStmt *pstmt, |
| 110 | + const char *queryString, |
| 111 | + ProcessUtilityContext context, |
| 112 | + ParamListInfo params, |
| 113 | + QueryEnvironment *queryEnv, |
| 114 | + DestReceiver *dest, |
| 115 | + QueryCompletion *qc) |
| 116 | +{ |
| 117 | + // Check if this is a SELECT query targeting your tables |
| 118 | + if (is_deeplake_query(queryString)) { |
| 119 | + // Translate the query |
| 120 | + std::string translated = pg::pg_to_duckdb_translator::translate(queryString); |
| 121 | +
|
| 122 | + // Execute via DuckDB |
| 123 | + execute_via_duckdb(translated); |
| 124 | + return; |
| 125 | + } |
| 126 | +
|
| 127 | + // Fall through to standard processing |
| 128 | + if (prev_ProcessUtility) |
| 129 | + prev_ProcessUtility(pstmt, queryString, context, params, queryEnv, dest, qc); |
| 130 | + else |
| 131 | + standard_ProcessUtility(pstmt, queryString, context, params, queryEnv, dest, qc); |
| 132 | +} |
| 133 | +
|
| 134 | +void _PG_init(void) { |
| 135 | + // Install the hook |
| 136 | + prev_ProcessUtility = ProcessUtility_hook; |
| 137 | + ProcessUtility_hook = deeplake_ProcessUtility; |
| 138 | +} |
| 139 | +``` |
| 140 | + |
| 141 | +### Option 3: Automatic Translation Flag |
| 142 | + |
| 143 | +Add an environment variable or GUC to enable automatic translation: |
| 144 | + |
| 145 | +```cpp |
| 146 | +// In extension_init.cpp or pg_deeplake.cpp |
| 147 | +static bool enable_pg_to_duckdb_translation = true; |
| 148 | + |
| 149 | +void _PG_init(void) { |
| 150 | + // Define GUC variable |
| 151 | + DefineCustomBoolVariable( |
| 152 | + "deeplake.enable_query_translation", |
| 153 | + "Enable automatic PostgreSQL to DuckDB query translation", |
| 154 | + NULL, |
| 155 | + &enable_pg_to_duckdb_translation, |
| 156 | + true, |
| 157 | + PGC_USERSET, |
| 158 | + 0, |
| 159 | + NULL, NULL, NULL |
| 160 | + ); |
| 161 | +} |
| 162 | + |
| 163 | +// Then in your query execution: |
| 164 | +std::string prepare_query(const std::string& query) { |
| 165 | + if (enable_pg_to_duckdb_translation) { |
| 166 | + return pg::pg_to_duckdb_translator::translate(query); |
| 167 | + } |
| 168 | + return query; |
| 169 | +} |
| 170 | +``` |
| 171 | +
|
| 172 | +## Example Queries |
| 173 | +
|
| 174 | +### Date Formatting with to_date |
| 175 | +
|
| 176 | +**PostgreSQL:** |
| 177 | +```sql |
| 178 | +SELECT id, name |
| 179 | +FROM users |
| 180 | +WHERE signup_date >= to_date('2024-01-01', 'YYYY-MM-DD'); |
| 181 | +``` |
| 182 | + |
| 183 | +**DuckDB:** |
| 184 | +```sql |
| 185 | +SELECT id, name |
| 186 | +FROM users |
| 187 | +WHERE (signup_date >= strptime('2024-01-01', 'YYYY-MM-DD')); |
| 188 | +``` |
| 189 | + |
| 190 | +### Integer Division with DIV |
| 191 | + |
| 192 | +**PostgreSQL:** |
| 193 | +```sql |
| 194 | +SELECT id, DIV(total_amount, 100) AS dollars |
| 195 | +FROM transactions |
| 196 | +WHERE DIV(quantity, 10) > 5; |
| 197 | +``` |
| 198 | + |
| 199 | +**DuckDB:** |
| 200 | +```sql |
| 201 | +SELECT id, (total_amount // 100) AS dollars |
| 202 | +FROM transactions |
| 203 | +WHERE ((quantity // 10) > 5); |
| 204 | +``` |
| 205 | + |
| 206 | +### Regular Expression Extraction |
| 207 | + |
| 208 | +**PostgreSQL:** |
| 209 | +```sql |
| 210 | +SELECT id, regexp_substr(email, '[^@]+') AS username |
| 211 | +FROM users |
| 212 | +WHERE regexp_substr(domain, '[a-z]+') = 'example'; |
| 213 | +``` |
| 214 | + |
| 215 | +**DuckDB:** |
| 216 | +```sql |
| 217 | +SELECT id, regexp_extract(email, '[^@]+') AS username |
| 218 | +FROM users |
| 219 | +WHERE (regexp_extract(domain, '[a-z]+') = 'example'); |
| 220 | +``` |
| 221 | + |
| 222 | +### Simple WHERE Clauses |
| 223 | + |
| 224 | +**PostgreSQL:** |
| 225 | +```sql |
| 226 | +SELECT COUNT(*) FROM bluesky |
| 227 | +WHERE json ->> 'kind' = 'commit' |
| 228 | + AND json -> 'commit' ->> 'operation' = 'create'; |
| 229 | +``` |
| 230 | + |
| 231 | +**DuckDB:** |
| 232 | +```sql |
| 233 | +SELECT COUNT(*) FROM bluesky |
| 234 | +WHERE json->>'$.kind' = 'commit' |
| 235 | + AND json->>'$.commit.operation' = 'create'; |
| 236 | +``` |
| 237 | + |
| 238 | +### Type Casting in WHERE |
| 239 | + |
| 240 | +**PostgreSQL:** |
| 241 | +```sql |
| 242 | +SELECT id, json ->> 'repo' as repo |
| 243 | +FROM bluesky |
| 244 | +WHERE (json ->> 'seq')::BIGINT > 1000000; |
| 245 | +``` |
| 246 | + |
| 247 | +**DuckDB:** |
| 248 | +```sql |
| 249 | +SELECT id, json->>'$.repo' as repo |
| 250 | +FROM bluesky |
| 251 | +WHERE CAST(json->>'$.seq' AS BIGINT) > 1000000; |
| 252 | +``` |
| 253 | + |
| 254 | +### IN Clauses |
| 255 | + |
| 256 | +**PostgreSQL:** |
| 257 | +```sql |
| 258 | +SELECT COUNT(*) FROM bluesky |
| 259 | +WHERE json ->> 'kind' = 'commit' |
| 260 | + AND json -> 'commit' ->> 'collection' IN ('app.bsky.feed.post', 'app.bsky.feed.like'); |
| 261 | +``` |
| 262 | + |
| 263 | +**DuckDB:** |
| 264 | +```sql |
| 265 | +SELECT COUNT(*) FROM bluesky |
| 266 | +WHERE json->>'$.kind' = 'commit' |
| 267 | + AND json->>'$.commit.collection' in ['app.bsky.feed.post', 'app.bsky.feed.like']; |
| 268 | +``` |
| 269 | + |
| 270 | +### Complex Aggregation with Timestamps |
| 271 | + |
| 272 | +**PostgreSQL:** |
| 273 | +```sql |
| 274 | +SELECT json->>'did' AS user_id, |
| 275 | + MIN(TIMESTAMP WITH TIME ZONE 'epoch' + |
| 276 | + INTERVAL '1 microsecond' * (json->>'time_us')::BIGINT) AS first_post_ts |
| 277 | +FROM bluesky |
| 278 | +WHERE json->>'kind' = 'commit' |
| 279 | +GROUP BY user_id |
| 280 | +LIMIT 3; |
| 281 | +``` |
| 282 | + |
| 283 | +**DuckDB:** |
| 284 | +```sql |
| 285 | +SELECT json->>'$.did' AS user_id, |
| 286 | + MIN(TO_TIMESTAMP(CAST(json->>'$.time_us' AS BIGINT) / 1000000) ) AS first_post_ts |
| 287 | +FROM bluesky |
| 288 | +WHERE json->>'$.kind' = 'commit' |
| 289 | +GROUP BY user_id |
| 290 | +LIMIT 3; |
| 291 | +``` |
| 292 | + |
| 293 | +## Testing |
| 294 | + |
| 295 | +Run the standalone test: |
| 296 | +```bash |
| 297 | +cd cpp/deeplake_pg |
| 298 | +chmod +x test_translator.sh |
| 299 | +./test_translator.sh |
| 300 | +``` |
| 301 | + |
| 302 | +All 24 test cases should pass: |
| 303 | +- ✓ Test 1: Simple JSON access with GROUP BY |
| 304 | +- ✓ Test 2: Multiple JSON access with WHERE |
| 305 | +- ✓ Test 3: EXTRACT HOUR with IN clause |
| 306 | +- ✓ Test 4: MIN with TIMESTAMP epoch conversion |
| 307 | +- ✓ Test 5: Date difference with EXTRACT EPOCH |
| 308 | +- ✓ Test 6: Simple WHERE with COUNT |
| 309 | +- ✓ Test 7: WHERE with nested JSON and string comparison |
| 310 | +- ✓ Test 8: WHERE with type cast |
| 311 | +- ✓ Test 9: WHERE with IN clause and multiple JSON operators |
| 312 | +- ✓ Test 10: Subquery with WHERE clause (regression test) |
| 313 | +- ✓ Test 11: to_date to strptime conversion |
| 314 | +- ✓ Test 12: DIV function to // operator |
| 315 | +- ✓ Test 13: regexp_substr to regexp_extract |
| 316 | +- ✓ Test 14: Combined date operations |
| 317 | +- ✓ Test 15: Arithmetic with DIV and type casts |
| 318 | +- ✓ Test 16: Pattern matching and string operations |
| 319 | +- ✓ Test 17: Case insensitive TO_DATE |
| 320 | +- ✓ Test 18: Nested functions with type casts |
| 321 | +- ✓ Test 19: Complex WHERE with mixed operators |
| 322 | +- ✓ Test 20: Multiple regexp_substr in SELECT and WHERE |
| 323 | +- ✓ Test 21: DOUBLE PRECISION compound type |
| 324 | +- ✓ Test 22: VARCHAR and TEXT type casts (real test from test_array.py) |
| 325 | +- ✓ Test 23: String concatenation with TEXT cast (real test from test_string_types.py) |
| 326 | +- ✓ Test 24: JSON column to TEXT cast (real test from test_json.py) |
| 327 | + |
| 328 | +## Build Integration |
| 329 | + |
| 330 | +The translator is automatically included in the PostgreSQL extension build via CMake's glob pattern in [CMakeLists.pg.cmake](../../CMakeLists.pg.cmake): |
| 331 | + |
| 332 | +```cmake |
| 333 | +file(GLOB_RECURSE PG_SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/${PG_MODULE}/*.cpp" ...) |
| 334 | +``` |
| 335 | + |
| 336 | +No additional build configuration is needed - just rebuild the extension: |
| 337 | +```bash |
| 338 | +python3 scripts/build_pg_ext.py dev |
| 339 | +``` |
| 340 | + |
| 341 | +## Debugging |
| 342 | + |
| 343 | +If queries fail after translation, enable logging: |
| 344 | + |
| 345 | +```cpp |
| 346 | +#include "pg_to_duckdb_translator.hpp" |
| 347 | + |
| 348 | +std::string query = "..."; |
| 349 | +std::string translated = pg::pg_to_duckdb_translator::translate(query); |
| 350 | + |
| 351 | +elog(INFO, "Original PostgreSQL query:"); |
| 352 | +elog(INFO, "%s", query.c_str()); |
| 353 | +elog(INFO, "Translated DuckDB query:"); |
| 354 | +elog(INFO, "%s", translated.c_str()); |
| 355 | +``` |
| 356 | +
|
| 357 | +## Performance Considerations |
| 358 | +
|
| 359 | +- **Translation overhead:** ~0.1-1ms per query (depends on query complexity) |
| 360 | +- **Caching:** For queries executed multiple times, consider caching translations |
| 361 | +- **Execution impact:** No impact on query execution time (translation happens once before execution) |
| 362 | +- **Implementation:** Uses regex-based pattern matching with manual parenthesis balancing for complex expressions |
| 363 | +- **Memory:** Minimal allocation - mostly string operations |
| 364 | +
|
| 365 | +## Rollback Plan |
| 366 | +
|
| 367 | +If you encounter issues: |
| 368 | +1. Set `deeplake.enable_query_translation = false` in postgresql.conf |
| 369 | +2. Or remove the translation call from your code temporarily |
| 370 | +3. Queries will be passed to DuckDB unchanged (may require manual DuckDB syntax) |
| 371 | +
|
| 372 | +## Limitations |
| 373 | +
|
| 374 | +- Does not handle all PostgreSQL syntax - focused on common JSON query patterns |
| 375 | +- Nested CAST expressions beyond 2 levels may need testing |
| 376 | +- Comments in SQL queries are not preserved |
| 377 | +- Multi-statement queries are translated as a whole (no statement separation) |
| 378 | +
|
| 379 | +## Future Enhancements |
| 380 | +
|
| 381 | +Possible improvements: |
| 382 | +1. Add support for more PostgreSQL-specific functions |
| 383 | +2. Handle JSON array indexing: `json->'array'->0` |
| 384 | +3. Support JSONB operators |
| 385 | +4. Add query validation/linting |
| 386 | +5. Performance profiling and optimization |
0 commit comments