Skip to content

Commit 82ac1de

Browse files
committed
More things
Signed-off-by: Adam Gutglick <adam@spiraldb.com>
1 parent 4e529b8 commit 82ac1de

File tree

1 file changed

+41
-9
lines changed

1 file changed

+41
-9
lines changed

proposals/0015-variant-type.md

Lines changed: 41 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,9 @@
11
- Start Date: 2025-02-25
2-
- RFC PR: [vortex-data/rfcs#0000](https://github.com/vortex-data/rfcs/pull/15)
2+
- RFC PR: [vortex-data/rfcs#15](https://github.com/vortex-data/rfcs/pull/15)
33
- Tracking Issue: [vortex-data/vortex#0000](https://github.com/vortex-data/vortex/issues/0000)
44

5+
# Variant Type
6+
57
## Summary
68

79
Vortex currently requires a strict schema, but real world data is often only semi-structured. Logs, traces and user-generated data often try to capture generally sparse data, which requires some processing to make it useful for most analytical systems.
@@ -18,19 +20,48 @@ The variant can be commonly described as the following rust type:
1820
enum Variant {
1921
Value(Scalar),
2022
List(Vec<Variant>),
21-
Object(BTreeMap<String, Variant>), // Usually sorted to allow efficent key finding
23+
Object(BTreeMap<String, Variant>), // Usually sorted to allow efficient key finding
2224
}
2325
```
2426

2527
Different systems have different variations of this idea, but at its core its a type that can hold nested data with either a flexible or no schema. In addition to this "catch all" column, most system include the concept of "shredding", extracting a key with a specific type out of this column, and storing it in a dense way. This design can make commonly access subfields perform like first-class columns, while keeping the overall schema flexible.
2628

2729
I propose a new dtype - `DType::Variant`. The variant type is always nullable, and its canonical encoding is just an array with a single child array, which is encoded in some specialized variant type.
2830

29-
In addition to a new canonical encoding, we'll need a few more pieces to make variant columns useful:
31+
### Nullability
32+
33+
In order to support data with a changing or unexpected schema, Variant arrays are always nullable, even for a specific key/path, its value might change type between items which will cause null values in shredded children.
34+
35+
Combined with shredding, handling nulls can be complex and is encoding dependent (Like this [parquet example](https://github.com/apache/parquet-format/blob/master/VariantShredding.md#arrays) for handling arrays).
36+
37+
### Expressions
38+
39+
Variant columns are commonly accessed through a combination of column, path and the desired type, which are all required to extract a column with a known type. Our current `GetItem` has two issues:
40+
41+
1. It assumes the input can be executed into a struct array.
42+
2. Access is only based on name.
43+
44+
I suggest we add a new expression - `get_variant_element(path, dtype)` (name TBD) which will support flexible paths and allow extracting children from variants.
45+
46+
Every variant encoding will need to be able to dispatch these behaviors, returning arrays of the expected type.
47+
48+
### Arrow representation
3049

31-
1. A set of new expressions, which extract children of variant arrays with a combination of path (similarly to `GetExpr`) and a dtype.
32-
2. Extending the compressor to support writing variant columns, and making choices like "which columns should be shredded" either automatically based on a set of heuristics, or by user-provided configuration.
33-
3. As different systems support different variations of this idea, we'll probably end up with multiple potential encodings. The most obvious one to start with is the `parquet-variant` arrow encoding, which is now a canonical Arrow extension type.
50+
Arrow now has a new [canonical extension type](https://arrow.apache.org/docs/format/CanonicalExtensions.html#parquet-variant) to represent Parquet's variant type. I think supporting this encoding will be a good start, but it requires supporting Arrow extension types.
51+
52+
Supporting extension types requires replacing the target `DataType` and nullability with a `Field`, which also includes metadata like a desired extension type. I belive this change is desired, as Vortex DTypes include more information than a plain `arrow::DataType`.
53+
54+
### Scalar
55+
56+
While there has been talk for a long time of converting the Vortex scalar system from an enum to length 1 arrays, I do believe the current system actually works very well for variants, and the Variant scalar can just be some version of the type described above.
57+
58+
Just like when extracting child arrays, Variant's need to support an additional expression, `get_variant_scalar(idx, path, dtype)` that will indicate the desired dtype.
59+
60+
### Constructing and writing scalars
61+
62+
The API for creating variant arrays is complex, as shredding decisions need to be made either before hand based on data-specific knowledge, or on the fly during writes.
63+
64+
In the medium/long term, I believe the compressor should support a JSON extension type, which will take JSON formatted UTF8 column, and parse it gradually into a binary formatted and typed variant encoding.
3465

3566
## Prior Art
3667

@@ -54,15 +85,16 @@ Statistics are only stored for the shredded columns, at the file/row group or pa
5485

5586
#### In-Memory
5687

57-
When loaded into memory, Arrow has defined a [canonical extension type](https://arrow.apache.org/docs/format/CanonicalExtensions.html#parquet-variant) to support Parquet's variant type. Its stored as a struct array, which contains a mandatory `metadata` binary child, an optional binary `value` child, and an optional `typed_value` which can be a "variant primitive", list or struct, allowing for nested shredding.
88+
When loaded into memory, Arrow has defined a [canonical extension type](https://arrow.apache.org/docs/format/CanonicalExtensions.html#parquet-variant) to support Parquet's variant type. Its stored as a struct array, which contains a mandatory `metadata` binary child, an optional binary `value` child, and an optional `typed_value` which can be a "variant primitive", list or struct, allowing for nested shredding.
5889

5990
### Clickhouse
6091

6192
As described in [this](https://clickhouse.com/blog/a-new-powerful-json-data-type-for-clickhouse#building-block-2---dynamic-type) fantastic blogpost, Clickhouse offers multiple features that build on top of each other to support similar data:
62-
1. [Variant](https://clickhouse.com/docs/sql-reference/data-types/variant) - Allows for arbitrary nesting of types. Variant can contains ints, strings and arrays of ints, strings or another variant type (note the lack of the "object" variant). Each leaf (`col_x.str` vs `col_x.int32`) column is stored separately with some additional metadata which points to which one is used by each row. Types have to be declared in advance.
93+
94+
1. [Variant](https://clickhouse.com/docs/sql-reference/data-types/variant) - Allows for arbitrary nesting of types. Variant can contains ints, strings and arrays of ints, strings or another variant type (note the lack of the "object" variant). Each leaf (`col_x.str` vs `col_x.int32`) column is stored separately with some additional metadata which points to which one is used by each row. Types have to be declared in advance.
6395
2. [Dynamic](https://clickhouse.com/docs/sql-reference/data-types/dynamic) - Like variant, but types don't have to be declared in advance. Shreds a limited number of columns
6496
3. [JSON](https://clickhouse.com/docs/sql-reference/data-types/newjson) - Builds on top of `Dynamic`, with a few specialized features - allowing users to specify known "typed paths", how many dynamic paths and types to support for untyped paths, and some JSON-specific configuration allowing skipping specific JSON paths on insert.
65-
The full blogpost is worth reading, but Clickhouse's on-disk model roughly mirrors the arrow in-memory format, and they store some metadata outside of the array.
97+
The full blogpost is worth reading, but Clickhouse's on-disk model roughly mirrors the arrow in-memory format, and they store some metadata outside of the array.
6698

6799
### Others
68100

0 commit comments

Comments
 (0)