|
| 1 | +# PostgreSQL to DuckDB JSON Query Translator |
| 2 | + |
| 3 | +## Overview |
| 4 | + |
| 5 | +This translator converts PostgreSQL JSON queries to DuckDB-compatible syntax, specifically handling differences in JSON operators, type casts, timestamp functions, and date operations. |
| 6 | + |
| 7 | +## Files |
| 8 | + |
| 9 | +- **[pg_to_duckdb_translator.hpp](pg_to_duckdb_translator.hpp)** - Header file with class declaration |
| 10 | +- **[pg_to_duckdb_translator.cpp](pg_to_duckdb_translator.cpp)** - Implementation |
| 11 | +- **[pg_to_duckdb_translator_test.cpp](pg_to_duckdb_translator_test.cpp)** - Test cases |
| 12 | + |
| 13 | +## Translation Rules |
| 14 | + |
| 15 | +### 1. JSON Operators |
| 16 | +- PostgreSQL: `json -> 'key'` or `json ->> 'key'` |
| 17 | +- DuckDB: `json->>'$.key'` |
| 18 | +- Chained: `json -> 'a' -> 'b' ->> 'c'` → `json->>'$.a.b.c'` |
| 19 | + |
| 20 | +### 2. Type Casts |
| 21 | +- PostgreSQL: `(expr)::TYPE` or `identifier::TYPE` |
| 22 | +- DuckDB: `CAST(expr AS TYPE)` |
| 23 | + |
| 24 | +### 3. Timestamp Conversions |
| 25 | +- PostgreSQL: `TIMESTAMP WITH TIME ZONE 'epoch' + INTERVAL '1 microsecond' * (json->>'time_us')::BIGINT` |
| 26 | +- DuckDB: `TO_TIMESTAMP(CAST(json->>'$.time_us' AS BIGINT) / 1000000)` |
| 27 | + |
| 28 | +### 4. Date Extraction |
| 29 | +- PostgreSQL: `EXTRACT(HOUR FROM expr)` |
| 30 | +- DuckDB: `hour(expr)` |
| 31 | +- Supported fields: HOUR, DAY, MONTH, YEAR, MINUTE, SECOND, DOW, DOY, WEEK |
| 32 | + |
| 33 | +### 5. Date Differences |
| 34 | +- PostgreSQL: `EXTRACT(EPOCH FROM (MAX(ts) - MIN(ts))) * 1000` |
| 35 | +- DuckDB: `date_diff('milliseconds', MIN(ts), MAX(ts))` |
| 36 | +- Multipliers: 1=seconds, 1000=milliseconds, 1000000=microseconds |
| 37 | + |
| 38 | +### 6. IN Clauses |
| 39 | +- PostgreSQL: `expr IN ('a', 'b', 'c')` |
| 40 | +- DuckDB: `expr in ['a', 'b', 'c']` |
| 41 | + |
| 42 | +### 7. WHERE Clause Predicates |
| 43 | +- PostgreSQL: `WHERE a = 1 AND b = 2` |
| 44 | +- DuckDB: `WHERE (a = 1) AND (b = 2)` |
| 45 | +- Each predicate separated by AND/OR is wrapped in parentheses |
| 46 | + |
| 47 | +## Usage |
| 48 | + |
| 49 | +```cpp |
| 50 | +#include "pg_to_duckdb_translator.hpp" |
| 51 | + |
| 52 | +std::string pg_query = "SELECT json -> 'commit' ->> 'collection' FROM table"; |
| 53 | +std::string duckdb_query = pg::pg_to_duckdb_translator::translate(pg_query); |
| 54 | +``` |
| 55 | + |
| 56 | +## Integration |
| 57 | + |
| 58 | +### Option 1: Integrate into DuckDB Executor |
| 59 | + |
| 60 | +Add the translator to your existing query execution path: |
| 61 | + |
| 62 | +```cpp |
| 63 | +// In duckdb_executor.cpp |
| 64 | +#include "pg_to_duckdb_translator.hpp" |
| 65 | + |
| 66 | +std::unique_ptr<duckdb::MaterializedQueryResult> |
| 67 | +execute_query(duckdb::Connection& conn, const std::string& pg_query) { |
| 68 | + // Translate PostgreSQL syntax to DuckDB syntax |
| 69 | + std::string duckdb_query = pg::pg_to_duckdb_translator::translate(pg_query); |
| 70 | + |
| 71 | + // Optional: Log the translation for debugging |
| 72 | + // elog(DEBUG1, "Original query: %s", pg_query.c_str()); |
| 73 | + // elog(DEBUG1, "Translated query: %s", duckdb_query.c_str()); |
| 74 | + |
| 75 | + // Execute the translated query |
| 76 | + auto result = conn.Query(duckdb_query); |
| 77 | + |
| 78 | + if (result->HasError()) { |
| 79 | + elog(ERROR, "DuckDB query failed: %s", result->GetError().c_str()); |
| 80 | + } |
| 81 | + |
| 82 | + return result; |
| 83 | +} |
| 84 | +``` |
| 85 | +
|
| 86 | +### Option 2: Add as a PostgreSQL Hook |
| 87 | +
|
| 88 | +Create a query rewrite hook: |
| 89 | +
|
| 90 | +```cpp |
| 91 | +// In your extension initialization |
| 92 | +static ProcessUtility_hook_type prev_ProcessUtility = NULL; |
| 93 | +
|
| 94 | +static void deeplake_ProcessUtility( |
| 95 | + PlannedStmt *pstmt, |
| 96 | + const char *queryString, |
| 97 | + ProcessUtilityContext context, |
| 98 | + ParamListInfo params, |
| 99 | + QueryEnvironment *queryEnv, |
| 100 | + DestReceiver *dest, |
| 101 | + QueryCompletion *qc) |
| 102 | +{ |
| 103 | + // Check if this is a SELECT query targeting your tables |
| 104 | + if (is_deeplake_query(queryString)) { |
| 105 | + // Translate the query |
| 106 | + std::string translated = pg::pg_to_duckdb_translator::translate(queryString); |
| 107 | +
|
| 108 | + // Execute via DuckDB |
| 109 | + execute_via_duckdb(translated); |
| 110 | + return; |
| 111 | + } |
| 112 | +
|
| 113 | + // Fall through to standard processing |
| 114 | + if (prev_ProcessUtility) |
| 115 | + prev_ProcessUtility(pstmt, queryString, context, params, queryEnv, dest, qc); |
| 116 | + else |
| 117 | + standard_ProcessUtility(pstmt, queryString, context, params, queryEnv, dest, qc); |
| 118 | +} |
| 119 | +
|
| 120 | +void _PG_init(void) { |
| 121 | + // Install the hook |
| 122 | + prev_ProcessUtility = ProcessUtility_hook; |
| 123 | + ProcessUtility_hook = deeplake_ProcessUtility; |
| 124 | +} |
| 125 | +``` |
| 126 | + |
| 127 | +### Option 3: Automatic Translation Flag |
| 128 | + |
| 129 | +Add an environment variable or GUC to enable automatic translation: |
| 130 | + |
| 131 | +```cpp |
| 132 | +// In extension_init.cpp or pg_deeplake.cpp |
| 133 | +static bool enable_pg_to_duckdb_translation = true; |
| 134 | + |
| 135 | +void _PG_init(void) { |
| 136 | + // Define GUC variable |
| 137 | + DefineCustomBoolVariable( |
| 138 | + "deeplake.enable_query_translation", |
| 139 | + "Enable automatic PostgreSQL to DuckDB query translation", |
| 140 | + NULL, |
| 141 | + &enable_pg_to_duckdb_translation, |
| 142 | + true, |
| 143 | + PGC_USERSET, |
| 144 | + 0, |
| 145 | + NULL, NULL, NULL |
| 146 | + ); |
| 147 | +} |
| 148 | + |
| 149 | +// Then in your query execution: |
| 150 | +std::string prepare_query(const std::string& query) { |
| 151 | + if (enable_pg_to_duckdb_translation) { |
| 152 | + return pg::pg_to_duckdb_translator::translate(query); |
| 153 | + } |
| 154 | + return query; |
| 155 | +} |
| 156 | +``` |
| 157 | +
|
| 158 | +## Example Queries |
| 159 | +
|
| 160 | +### Simple WHERE Clauses |
| 161 | +
|
| 162 | +**PostgreSQL:** |
| 163 | +```sql |
| 164 | +SELECT COUNT(*) FROM bluesky |
| 165 | +WHERE json ->> 'kind' = 'commit' |
| 166 | + AND json -> 'commit' ->> 'operation' = 'create'; |
| 167 | +``` |
| 168 | + |
| 169 | +**DuckDB:** |
| 170 | +```sql |
| 171 | +SELECT COUNT(*) FROM bluesky |
| 172 | +WHERE json->>'$.kind' = 'commit' |
| 173 | + AND json->>'$.commit.operation' = 'create'; |
| 174 | +``` |
| 175 | + |
| 176 | +### Type Casting in WHERE |
| 177 | + |
| 178 | +**PostgreSQL:** |
| 179 | +```sql |
| 180 | +SELECT id, json ->> 'repo' as repo |
| 181 | +FROM bluesky |
| 182 | +WHERE (json ->> 'seq')::BIGINT > 1000000; |
| 183 | +``` |
| 184 | + |
| 185 | +**DuckDB:** |
| 186 | +```sql |
| 187 | +SELECT id, json->>'$.repo' as repo |
| 188 | +FROM bluesky |
| 189 | +WHERE CAST(json->>'$.seq' AS BIGINT) > 1000000; |
| 190 | +``` |
| 191 | + |
| 192 | +### IN Clauses |
| 193 | + |
| 194 | +**PostgreSQL:** |
| 195 | +```sql |
| 196 | +SELECT COUNT(*) FROM bluesky |
| 197 | +WHERE json ->> 'kind' = 'commit' |
| 198 | + AND json -> 'commit' ->> 'collection' IN ('app.bsky.feed.post', 'app.bsky.feed.like'); |
| 199 | +``` |
| 200 | + |
| 201 | +**DuckDB:** |
| 202 | +```sql |
| 203 | +SELECT COUNT(*) FROM bluesky |
| 204 | +WHERE json->>'$.kind' = 'commit' |
| 205 | + AND json->>'$.commit.collection' in ['app.bsky.feed.post', 'app.bsky.feed.like']; |
| 206 | +``` |
| 207 | + |
| 208 | +### Complex Aggregation with Timestamps |
| 209 | + |
| 210 | +**PostgreSQL:** |
| 211 | +```sql |
| 212 | +SELECT json->>'did' AS user_id, |
| 213 | + MIN(TIMESTAMP WITH TIME ZONE 'epoch' + |
| 214 | + INTERVAL '1 microsecond' * (json->>'time_us')::BIGINT) AS first_post_ts |
| 215 | +FROM bluesky |
| 216 | +WHERE json->>'kind' = 'commit' |
| 217 | +GROUP BY user_id |
| 218 | +LIMIT 3; |
| 219 | +``` |
| 220 | + |
| 221 | +**DuckDB:** |
| 222 | +```sql |
| 223 | +SELECT json->>'$.did' AS user_id, |
| 224 | + MIN(TO_TIMESTAMP(CAST(json->>'$.time_us' AS BIGINT) / 1000000) ) AS first_post_ts |
| 225 | +FROM bluesky |
| 226 | +WHERE json->>'$.kind' = 'commit' |
| 227 | +GROUP BY user_id |
| 228 | +LIMIT 3; |
| 229 | +``` |
| 230 | + |
| 231 | +## Testing |
| 232 | + |
| 233 | +Run the standalone test: |
| 234 | +```bash |
| 235 | +cd cpp/deeplake_pg |
| 236 | +chmod +x test_translator.sh |
| 237 | +./test_translator.sh |
| 238 | +``` |
| 239 | + |
| 240 | +All 10 test cases should pass: |
| 241 | +- ✓ Test 1: Simple JSON access with GROUP BY |
| 242 | +- ✓ Test 2: Multiple JSON access with WHERE |
| 243 | +- ✓ Test 3: EXTRACT HOUR with IN clause |
| 244 | +- ✓ Test 4: MIN with TIMESTAMP epoch conversion |
| 245 | +- ✓ Test 5: Date difference with EXTRACT EPOCH |
| 246 | +- ✓ Test 6: Simple WHERE with COUNT |
| 247 | +- ✓ Test 7: WHERE with nested JSON and string comparison |
| 248 | +- ✓ Test 8: WHERE with type cast |
| 249 | +- ✓ Test 9: WHERE with IN clause and multiple JSON operators |
| 250 | +- ✓ Test 10: Subquery with WHERE clause (regression test) |
| 251 | + |
| 252 | +## Build Integration |
| 253 | + |
| 254 | +The translator is automatically included in the PostgreSQL extension build via CMake's glob pattern in [CMakeLists.pg.cmake](../../CMakeLists.pg.cmake): |
| 255 | + |
| 256 | +```cmake |
| 257 | +file(GLOB_RECURSE PG_SOURCES "${CMAKE_CURRENT_SOURCE_DIR}/${PG_MODULE}/*.cpp" ...) |
| 258 | +``` |
| 259 | + |
| 260 | +No additional build configuration is needed - just rebuild the extension: |
| 261 | +```bash |
| 262 | +python3 scripts/build_pg_ext.py dev |
| 263 | +``` |
| 264 | + |
| 265 | +## Debugging |
| 266 | + |
| 267 | +If queries fail after translation, enable logging: |
| 268 | + |
| 269 | +```cpp |
| 270 | +#include "pg_to_duckdb_translator.hpp" |
| 271 | + |
| 272 | +std::string query = "..."; |
| 273 | +std::string translated = pg::pg_to_duckdb_translator::translate(query); |
| 274 | + |
| 275 | +elog(INFO, "Original PostgreSQL query:"); |
| 276 | +elog(INFO, "%s", query.c_str()); |
| 277 | +elog(INFO, "Translated DuckDB query:"); |
| 278 | +elog(INFO, "%s", translated.c_str()); |
| 279 | +``` |
| 280 | +
|
| 281 | +## Performance Considerations |
| 282 | +
|
| 283 | +- **Translation overhead:** ~0.1-1ms per query (depends on query complexity) |
| 284 | +- **Caching:** For queries executed multiple times, consider caching translations |
| 285 | +- **Execution impact:** No impact on query execution time (translation happens once before execution) |
| 286 | +- **Implementation:** Uses regex-based pattern matching with manual parenthesis balancing for complex expressions |
| 287 | +- **Memory:** Minimal allocation - mostly string operations |
| 288 | +
|
| 289 | +## Rollback Plan |
| 290 | +
|
| 291 | +If you encounter issues: |
| 292 | +1. Set `deeplake.enable_query_translation = false` in postgresql.conf |
| 293 | +2. Or remove the translation call from your code temporarily |
| 294 | +3. Queries will be passed to DuckDB unchanged (may require manual DuckDB syntax) |
| 295 | +
|
| 296 | +## Limitations |
| 297 | +
|
| 298 | +- Does not handle all PostgreSQL syntax - focused on common JSON query patterns |
| 299 | +- Nested CAST expressions beyond 2 levels may need testing |
| 300 | +- Comments in SQL queries are not preserved |
| 301 | +- Multi-statement queries are translated as a whole (no statement separation) |
| 302 | +
|
| 303 | +## Future Enhancements |
| 304 | +
|
| 305 | +Possible improvements: |
| 306 | +1. Add support for more PostgreSQL-specific functions |
| 307 | +2. Handle JSON array indexing: `json->'array'->0` |
| 308 | +3. Support JSONB operators |
| 309 | +4. Add query validation/linting |
| 310 | +5. Performance profiling and optimization |
0 commit comments