You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -2178,7 +2178,6 @@ When `Dynamic` has more distinct types than `max_types`:
2178
2178
- Additional types are stored in binary form in a "SharedVariant" type (always the last variant)
2179
2179
- Each value in SharedVariant is: `encodeDataType(T) + serializeBinary(value, T)`
2180
2180
- On deserialization, the type is decoded from the binary prefix and the value is deserialized
2181
-
2182
2181
### 6.9 JSON
2183
2182
2184
2183
The `JSON` type stores semi-structured JSON data with automatic schema inference. It uses an object-oriented storage model where JSON paths are stored as separate columns, similar to how `Nested` columns work.
@@ -2225,7 +2224,9 @@ Similar to `Dynamic`, `JSON` uses multiple streams:
2225
2224
**V3 (version = 4):**
2226
2225
- Like V2 with binary type encoding and optional statistics flag
2227
2226
2228
-
> **Note:** Version numbers are not sequential. V1=0, V2=2, V3=4. There is no version 3.
0x03, 0x00, 0x00, 0x00, // value = 3 (Int64 little-endian)
2692
+
0x00, 0x00, 0x00, 0x00,
2693
+
]);
2694
+
```
2695
+
2696
+
**Key observations for shared data:**
2697
+
2698
+
1.**Overflow mechanism**: When paths exceed `max_dynamic_paths`, extra paths go to shared data. Here "a" and "b" are dynamic paths, "c" overflows.
2699
+
2700
+
2.**Shared data format**: `Array(Tuple(path: String, value: String))`
2701
+
- Array offsets indicate how many shared paths each row has
2702
+
- Each element is a (path_name, binary_encoded_value) tuple
2703
+
2704
+
3.**Binary-encoded values in shared data**: The value is stored as a length-prefixed string containing:
2705
+
-`BinaryTypeIndex` (1 byte): Type identifier (0x0a = Int64)
2706
+
- Native value: The value in its type's binary format
2707
+
2708
+
4.**BinaryTypeIndex values** (common types):
2709
+
| Type | Index |
2710
+
|------|-------|
2711
+
| Nothing | 0x00 |
2712
+
| UInt8 | 0x01 |
2713
+
| UInt32 | 0x03 |
2714
+
| UInt64 | 0x04 |
2715
+
| Int64 | 0x0a |
2716
+
| String | 0x15 |
2717
+
| Array | 0x1e |
2718
+
| JSON | 0x30 |
2719
+
2720
+
5.**Space tradeoff**: Shared data is less efficient than dynamic paths because each value includes its path name and type encoding. Use `max_dynamic_paths` wisely for your data.
Variant types are sorted **alphabetically by type name** to determine discriminator values. SharedVariant is included in this sorting. Here's an example showing how discriminators are assigned:
2725
+
2726
+
```sql
2727
+
-- Insert multiple rows with different JSON value types
2728
+
INSERT INTO test_json VALUES
2729
+
('{"x": true}'), -- Bool
2730
+
('{"x": 3.14}'), -- Float64
2731
+
('{"x": "hello"}'), -- String
2732
+
('{"x": [1,2,3]}'), -- Array(Nullable(Int64))
2733
+
('{"x": null}') -- NULL
2734
+
```
2735
+
2736
+
The Dynamic column for path "x" will have variant types sorted alphabetically:
2737
+
2738
+
| Index | Type | Notes |
2739
+
|-------|------|-------|
2740
+
| 0 | Array(Nullable(Int64)) | "A" < "B" < ... |
2741
+
| 1 | Bool ||
2742
+
| 2 | Float64 ||
2743
+
| 3 | SharedVariant | Implicit, always present |
2744
+
| 4 | String | "Sh..." < "St..." |
2745
+
| 255 | (NULL) | Special NULL_DISCRIMINATOR |
2746
+
2747
+
The discriminator bytes in the serialized data will be:
2748
+
```
2749
+
0x01, // Bool (index 1)
2750
+
0x02, // Float64 (index 2)
2751
+
0x04, // String (index 4, after SharedVariant=3)
2752
+
0x00, // Array (index 0)
2753
+
0xff, // NULL (255)
2754
+
```
2755
+
2756
+
**Key rule**: For any type T in a Dynamic column, its discriminator = index in the alphabetically sorted list of [all_types + SharedVariant].
0 commit comments