Skip to content

Commit bbf3b72

Browse files
update nativespecc json
1 parent b69ef66 commit bbf3b72

1 file changed

Lines changed: 195 additions & 18 deletions

File tree

nativespec.md

Lines changed: 195 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -2178,7 +2178,6 @@ When `Dynamic` has more distinct types than `max_types`:
21782178
- Additional types are stored in binary form in a "SharedVariant" type (always the last variant)
21792179
- Each value in SharedVariant is: `encodeDataType(T) + serializeBinary(value, T)`
21802180
- On deserialization, the type is decoded from the binary prefix and the value is deserialized
2181-
21822181
### 6.9 JSON
21832182

21842183
The `JSON` type stores semi-structured JSON data with automatic schema inference. It uses an object-oriented storage model where JSON paths are stored as separate columns, similar to how `Nested` columns work.
@@ -2225,7 +2224,9 @@ Similar to `Dynamic`, `JSON` uses multiple streams:
22252224
**V3 (version = 4):**
22262225
- Like V2 with binary type encoding and optional statistics flag
22272226

2228-
> **Note:** Version numbers are not sequential. V1=0, V2=2, V3=4. There is no version 3.
2227+
> **Note:** JSON Object version numbers are: V1=0, STRING=1, V2=2, FLATTENED=3, V3=4.
2228+
2229+
> **Important:** `Dynamic` columns (used for dynamic paths) have **different** version numbers: V1=1, V2=2, FLATTENED=3, V3=4. Don't confuse JSON V1 (=0) with Dynamic V1 (=1).
22292230
22302231
**STRING (version = 1):**
22312232
- Special mode for Native format only
@@ -2388,15 +2389,15 @@ const block = new Uint8Array([
23882389
// - typed path "a" is NOT listed here
23892390
// - only runtime-discovered paths appear
23902391

2391-
// Statistics (PREFIX mode):
2392-
0x01, // Path "b" non-null count = 1 (VarUInt)
2393-
0x00, // Shared data statistics: 0 entries (VarUInt)
2392+
// NOTE: In Native format, statistics mode defaults to NONE, so no
2393+
// statistics are written here. The next bytes are the Dynamic prefix.
23942394

23952395
// ═══════════════════════════════════════════════════════════════════
23962396
// DYNAMIC STRUCTURE STREAM (prefix for dynamic path "b")
23972397
// ═══════════════════════════════════════════════════════════════════
2398-
0x00, 0x00, 0x00, 0x00, // Dynamic version = 0 (V1)
2398+
0x01, 0x00, 0x00, 0x00, // Dynamic version = 1 (V1)
23992399
0x00, 0x00, 0x00, 0x00, // (UInt64 little-endian)
2400+
// NOTE: Dynamic V1=1, V2=2, FLATTENED=3, V3=4
24002401

24012402
// V1 Dynamic structure:
24022403
0x01, // max_dynamic_types = 1 (VarUInt)
@@ -2406,10 +2407,8 @@ const block = new Uint8Array([
24062407
0x06, 0x53, 0x74, 0x72, 0x69, // Type 0: "String" (len=6)
24072408
0x6e, 0x67,
24082409

2409-
// Dynamic statistics (for [String, SharedVariant]):
2410-
0x00, // String variant row count = 0 (placeholder)
2411-
0x00, // SharedVariant row count = 0 (placeholder)
2412-
0x00, // Shared variant statistics: 0 entries
2410+
// NOTE: Dynamic statistics also use NONE mode in Native format,
2411+
// so no statistics written here either.
24132412

24142413
// ═══════════════════════════════════════════════════════════════════
24152414
// VARIANT STRUCTURE STREAM (prefix for Dynamic's Variant)
@@ -2428,7 +2427,8 @@ const block = new Uint8Array([
24282427
// --- DYNAMIC PATH "b" (as Dynamic -> Variant) ---
24292428
// Variant discriminator column (1 row):
24302429
0x01, // Row 0: discriminator = 1 (String variant)
2431-
// 0 = NULL, 1 = String, 2 = SharedVariant
2430+
// Variants sorted alphabetically: SharedVariant=0, String=1
2431+
// (255 = NULL)
24322432

24332433
// String variant data:
24342434
0x02, 0x68, 0x69, // value = "hi" (len=2, then "hi")
@@ -2515,23 +2515,20 @@ const block = new Uint8Array([
25152515
0x01, // num_dynamic_paths = 1 (VarUInt)
25162516
0x01, 0x62, // Path 0: "b" (len=1)
25172517

2518-
0x01, // Statistics: path "b" non-null count
2519-
0x00, // Shared data statistics: 0 entries
2518+
// NOTE: No statistics in Native format (NONE mode is default)
25202519

25212520
// ═══════════════════════════════════════════════════════════════════
25222521
// DYNAMIC STRUCTURE STREAM (for path "b")
25232522
// ═══════════════════════════════════════════════════════════════════
2524-
0x00, 0x00, 0x00, 0x00, // Dynamic version = 0 (V1)
2525-
0x00, 0x00, 0x00, 0x00,
2523+
0x01, 0x00, 0x00, 0x00, // Dynamic version = 1 (V1)
2524+
0x00, 0x00, 0x00, 0x00, // (Dynamic V1=1, V2=2, V3=4)
25262525

25272526
0x01, // max_dynamic_types = 1 (VarUInt)
25282527
0x01, // num_dynamic_types = 1 (VarUInt)
25292528
0x06, 0x53, 0x74, 0x72, 0x69, // Type 0: "String" (len=6)
25302529
0x6e, 0x67,
25312530

2532-
0x00, // Dynamic statistics (variant counts)
2533-
0x00,
2534-
0x00,
2531+
// NOTE: No Dynamic statistics in Native format (NONE mode)
25352532

25362533
// ═══════════════════════════════════════════════════════════════════
25372534
// VARIANT STRUCTURE STREAM
@@ -2548,6 +2545,7 @@ const block = new Uint8Array([
25482545
0x0a, 0x00, 0x00, 0x00, // Row 1: a = 10
25492546

25502547
// --- DYNAMIC PATH "b" - Variant discriminators (ALL ROWS) ---
2548+
// Variants sorted alphabetically: SharedVariant=0, String=1 (255=NULL)
25512549
0x01, // Row 0: discriminator = 1 (String)
25522550
0x01, // Row 1: discriminator = 1 (String)
25532551

@@ -2578,6 +2576,185 @@ const block = new Uint8Array([
25782576

25792577
4. **Shared data offsets**: One UInt64 offset per row indicating how many shared path entries that row has. With 0 shared paths, all offsets are 0.
25802578

2579+
**Example: JSON with exceeded max_dynamic_paths (shared data)**
2580+
2581+
This example shows what happens when the number of paths exceeds `max_dynamic_paths`. Overflow paths are stored in the **shared data** section.
2582+
2583+
```bash
2584+
curl -s -XPOST "http://localhost:8123?default_format=Native&allow_experimental_json_type=1" \
2585+
--data-binary "SELECT '{\"a\": 1, \"b\": 2, \"c\": 3}'::JSON(max_dynamic_paths=2) AS col" | xxd
2586+
```
2587+
2588+
With `max_dynamic_paths=2` and 3 paths in the JSON:
2589+
- Paths "a" and "b" become **dynamic paths** (stored as `Dynamic` columns)
2590+
- Path "c" overflows to **shared data**
2591+
2592+
```
2593+
00000000: 0101 0363 6f6c 194a 534f 4e28 6d61 785f ...col.JSON(max_
2594+
00000010: 6479 6e61 6d69 635f 7061 7468 733d 3229 dynamic_paths=2)
2595+
00000020: 0000 0000 0000 0000 0202 0161 0162 0100 ...........a.b..
2596+
00000030: 0000 0000 0000 0101 0549 6e74 3634 0000 .........Int64..
2597+
00000040: 0000 0000 0000 0100 0000 0000 0000 0101 ................
2598+
00000050: 0549 6e74 3634 0000 0000 0000 0000 0001 .Int64..........
2599+
00000060: 0000 0000 0000 0000 0200 0000 0000 0000 ................
2600+
00000070: 0100 0000 0000 0000 0163 090a 0300 0000 .........c......
2601+
00000080: 0000 0000 ....
2602+
```
2603+
2604+
Full breakdown:
2605+
```typescript
2606+
const block = new Uint8Array([
2607+
// ═══════════════════════════════════════════════════════════════════
2608+
// BLOCK HEADER
2609+
// ═══════════════════════════════════════════════════════════════════
2610+
0x01, // NumColumns = 1
2611+
0x01, // NumRows = 1
2612+
2613+
// ═══════════════════════════════════════════════════════════════════
2614+
// COLUMN HEADER
2615+
// ═══════════════════════════════════════════════════════════════════
2616+
0x03, 0x63, 0x6f, 0x6c, // Column name = "col"
2617+
0x19, // Type name length = 25
2618+
// "JSON(max_dynamic_paths=2)"
2619+
0x4a, 0x53, 0x4f, 0x4e, 0x28, 0x6d, 0x61, 0x78, 0x5f,
2620+
0x64, 0x79, 0x6e, 0x61, 0x6d, 0x69, 0x63, 0x5f,
2621+
0x70, 0x61, 0x74, 0x68, 0x73, 0x3d, 0x32, 0x29,
2622+
2623+
// ═══════════════════════════════════════════════════════════════════
2624+
// OBJECT STRUCTURE STREAM
2625+
// ═══════════════════════════════════════════════════════════════════
2626+
0x00, 0x00, 0x00, 0x00, // Version = 0 (V1)
2627+
0x00, 0x00, 0x00, 0x00,
2628+
2629+
0x02, // max_dynamic_paths = 2
2630+
0x02, // num_dynamic_paths = 2
2631+
0x01, 0x61, // Path 0: "a"
2632+
0x01, 0x62, // Path 1: "b"
2633+
// Note: "c" is NOT here - it's in shared data
2634+
2635+
// NOTE: No statistics in Native format (NONE mode is default)
2636+
2637+
// ═══════════════════════════════════════════════════════════════════
2638+
// DYNAMIC PATH "a" STRUCTURE
2639+
// ═══════════════════════════════════════════════════════════════════
2640+
0x01, 0x00, 0x00, 0x00, // Dynamic version = 1 (V1)
2641+
0x00, 0x00, 0x00, 0x00, // (Dynamic V1=1, V2=2, V3=4)
2642+
0x01, // max_dynamic_types = 1
2643+
0x01, // num_dynamic_types = 1
2644+
0x05, 0x49, 0x6e, 0x74, 0x36, 0x34, // Type: "Int64"
2645+
// NOTE: No Dynamic stats (NONE mode)
2646+
0x00, 0x00, 0x00, 0x00, // Variant mode = 0 (COMPACT)
2647+
0x00, 0x00, 0x00, 0x00,
2648+
2649+
// ═══════════════════════════════════════════════════════════════════
2650+
// DYNAMIC PATH "b" STRUCTURE (similar to "a")
2651+
// ═══════════════════════════════════════════════════════════════════
2652+
0x01, 0x00, 0x00, 0x00, // Dynamic version = 1 (V1)
2653+
0x00, 0x00, 0x00, 0x00,
2654+
0x01, // max_dynamic_types = 1
2655+
0x01, // num_dynamic_types = 1
2656+
0x05, 0x49, 0x6e, 0x74, 0x36, 0x34, // Type: "Int64"
2657+
0x00, 0x00, 0x00, 0x00, // Variant mode = 0 (COMPACT)
2658+
0x00, 0x00, 0x00, 0x00,
2659+
2660+
// ═══════════════════════════════════════════════════════════════════
2661+
// DYNAMIC PATH "a" DATA (Int64 value = 1)
2662+
// ═══════════════════════════════════════════════════════════════════
2663+
// Variants sorted alphabetically: Int64=0, SharedVariant=1 (255=NULL)
2664+
0x00, // Discriminator = 0 (Int64 variant)
2665+
0x01, 0x00, 0x00, 0x00, // value = 1 (Int64 little-endian)
2666+
0x00, 0x00, 0x00, 0x00,
2667+
2668+
// ═══════════════════════════════════════════════════════════════════
2669+
// DYNAMIC PATH "b" DATA (Int64 value = 2)
2670+
// ═══════════════════════════════════════════════════════════════════
2671+
0x00, // Discriminator = 0 (Int64 variant)
2672+
0x02, 0x00, 0x00, 0x00, // value = 2 (Int64 little-endian)
2673+
0x00, 0x00, 0x00, 0x00,
2674+
2675+
// ═══════════════════════════════════════════════════════════════════
2676+
// SHARED DATA STREAM
2677+
// This is where path "c" lives (overflow from max_dynamic_paths)
2678+
// Format: Array(Tuple(path: String, value: String))
2679+
// ═══════════════════════════════════════════════════════════════════
2680+
2681+
// Array offsets (one UInt64 per row):
2682+
0x01, 0x00, 0x00, 0x00, // Row 0: offset = 1 (has 1 element)
2683+
0x00, 0x00, 0x00, 0x00,
2684+
2685+
// Shared data element 0 (path "c"):
2686+
0x01, 0x63, // Path name: "c" (len=1)
2687+
2688+
// Binary-encoded Dynamic value for "c":
2689+
0x09, // Value string length = 9 bytes
2690+
0x0a, // BinaryTypeIndex = 0x0a (Int64)
2691+
0x03, 0x00, 0x00, 0x00, // value = 3 (Int64 little-endian)
2692+
0x00, 0x00, 0x00, 0x00,
2693+
]);
2694+
```
2695+
2696+
**Key observations for shared data:**
2697+
2698+
1. **Overflow mechanism**: When paths exceed `max_dynamic_paths`, extra paths go to shared data. Here "a" and "b" are dynamic paths, "c" overflows.
2699+
2700+
2. **Shared data format**: `Array(Tuple(path: String, value: String))`
2701+
- Array offsets indicate how many shared paths each row has
2702+
- Each element is a (path_name, binary_encoded_value) tuple
2703+
2704+
3. **Binary-encoded values in shared data**: The value is stored as a length-prefixed string containing:
2705+
- `BinaryTypeIndex` (1 byte): Type identifier (0x0a = Int64)
2706+
- Native value: The value in its type's binary format
2707+
2708+
4. **BinaryTypeIndex values** (common types):
2709+
| Type | Index |
2710+
|------|-------|
2711+
| Nothing | 0x00 |
2712+
| UInt8 | 0x01 |
2713+
| UInt32 | 0x03 |
2714+
| UInt64 | 0x04 |
2715+
| Int64 | 0x0a |
2716+
| String | 0x15 |
2717+
| Array | 0x1e |
2718+
| JSON | 0x30 |
2719+
2720+
5. **Space tradeoff**: Shared data is less efficient than dynamic paths because each value includes its path name and type encoding. Use `max_dynamic_paths` wisely for your data.
2721+
2722+
**Example: Variant discriminator ordering (multiple types)**
2723+
2724+
Variant types are sorted **alphabetically by type name** to determine discriminator values. SharedVariant is included in this sorting. Here's an example showing how discriminators are assigned:
2725+
2726+
```sql
2727+
-- Insert multiple rows with different JSON value types
2728+
INSERT INTO test_json VALUES
2729+
('{"x": true}'), -- Bool
2730+
('{"x": 3.14}'), -- Float64
2731+
('{"x": "hello"}'), -- String
2732+
('{"x": [1,2,3]}'), -- Array(Nullable(Int64))
2733+
('{"x": null}') -- NULL
2734+
```
2735+
2736+
The Dynamic column for path "x" will have variant types sorted alphabetically:
2737+
2738+
| Index | Type | Notes |
2739+
|-------|------|-------|
2740+
| 0 | Array(Nullable(Int64)) | "A" < "B" < ... |
2741+
| 1 | Bool | |
2742+
| 2 | Float64 | |
2743+
| 3 | SharedVariant | Implicit, always present |
2744+
| 4 | String | "Sh..." < "St..." |
2745+
| 255 | (NULL) | Special NULL_DISCRIMINATOR |
2746+
2747+
The discriminator bytes in the serialized data will be:
2748+
```
2749+
0x01, // Bool (index 1)
2750+
0x02, // Float64 (index 2)
2751+
0x04, // String (index 4, after SharedVariant=3)
2752+
0x00, // Array (index 0)
2753+
0xff, // NULL (255)
2754+
```
2755+
2756+
**Key rule**: For any type T in a Dynamic column, its discriminator = index in the alphabetically sorted list of [all_types + SharedVariant].
2757+
25812758
**Path Naming:**
25822759

25832760
JSON paths use dot notation:

0 commit comments

Comments
 (0)