Skip to content

Commit 21c6e6e

Browse files
committed
first draft extension_types
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
1 parent c3801e5 commit 21c6e6e

File tree

1 file changed

+97
-0
lines changed

1 file changed

+97
-0
lines changed

proposals/0005-extension-types.md

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
- Start Date: (2026-02-27)
2+
- RFC PR: [vortex-data/rfcs#0000](https://github.com/vortex-data/rfcs/pull/0000)
3+
- Tracking Issue: [vortex-data/vortex#6547](https://github.com/vortex-data/vortex/issues/6547)
4+
5+
## Summary
6+
7+
We would like to build a more robust system for extension data types (or DTypes).
8+
9+
TODO
10+
11+
## Motivation
12+
13+
TODO
14+
15+
## Design
16+
17+
[vortex-data/vortex#6081](https://github.com/vortex-data/vortex/pull/6081) introduced vtables (virtual tables, or Rust unit structs with methods) for extension `DType`s. Each extension type (e.g. `Timestamp`) now implements `ExtDTypeVTable`, which handles validation, serialization, and metadata. The type-erased `ExtDTypeRef` carries this vtable with it inside `DType::Extension`.
18+
19+
There were a few blockers (detailed in the previous tracking issue [vortex-data/vortex#6547](https://github.com/vortex-data/vortex/issues/6547)), but now that those have been resolved we can move forward with this.
20+
21+
Now that `vortex-scalar` and `vortex-dtype` have been merged into `vortex-array`, we can now place all extension logic (for types, scalars, and arrays) onto an `ExtVTable`. It will look something like so:
22+
23+
```rust
24+
// Naming should be considered VERY unstable / not set!
25+
26+
pub trait ExtVTable: 'static + Send + Sync + ... {
27+
// Extra data that complements the extension type.
28+
type Metadata: ...;
29+
30+
// A native Rust value that represents a scalar of the extension type.
31+
type Value<'a>: Display;
32+
33+
// `DType`
34+
35+
fn id(&self) -> ExtID;
36+
fn validate_dtype(&self, metadata: &Self::Metadata, storage_dtype: &DType) -> VortexResult<()>;
37+
fn serialize_metadata(&self, metadata: &Self::Metadata) -> VortexResult<Vec<u8>>;
38+
fn deserialize_metadata(&self, data: &[u8]) -> VortexResult<Self::Metadata>;
39+
40+
// `Scalar`
41+
42+
fn validate_scalar_value(&self, metadata: &Self::Metadata, storage_dtype: &DType, storage_value: &ScalarValue) -> VortexResult<()>;
43+
fn unpack<'a>(&self, metadata: &'a Self::Metadata, storage_dtype: &'a DType, storage_value: &'a ScalarValue) -> Self::Value<'a>;
44+
fn cast_scalar(&self, metadata: &Self::Metadata, scalar: &Scalar, target: &DType) -> VortexResult<Scalar> { ... }
45+
46+
// `ArrayRef`
47+
48+
fn validate_array(&self, metadata: &Self::Metadata, storage_array: &ArrayRef) -> VortexResult<()>;
49+
fn cast_array(&self, metadata: &Self::Metadata, array: &ArrayRef, target: &DType) -> VortexResult<ArrayRef> { ... }
50+
fn other_compute_thing???(&self, ...) -> VortexResult<ArrayRef> { ... }
51+
// <-- Probably a lot more than this -->
52+
}
53+
```
54+
55+
TODO
56+
57+
## Compatibility
58+
59+
TODO
60+
61+
## Drawbacks
62+
63+
TODO
64+
65+
## Alternatives
66+
67+
TODO
68+
69+
## Prior Art
70+
71+
TODO
72+
73+
## Unresolved Questions
74+
75+
TODO
76+
77+
## Future Possibilities
78+
79+
If we can get extension types working well, then theoretically we can easily add all of these types:
80+
81+
- `DateTimeParts` (`Primitive`)
82+
- Matrix (`FixedSizeList`)
83+
- Tensor (`FixedSizeList`)
84+
- UUID (Do we need to add `FixedSizeBinary` as a canonical type?)
85+
- JSON (`UTF8`)
86+
- PDX: https://arxiv.org/pdf/2503.04422v1 (`FixedSizeList`)
87+
- Variant
88+
- Shredding (Lots of possibilities here!)
89+
- Union
90+
- Sparse (`Struct { Primitive, Struct { types } }`)
91+
- Dense[^1]
92+
- Map (`List<Struct { K, V }>`)
93+
- Tags: https://github.com/vortex-data/vortex/discussions/5772#discussioncomment-15279892 (`ListView<Utf8>`)
94+
- `Struct` but with protobuf-style field numbers (`Struct`)
95+
- Probably lots more!
96+
97+
[^1]: `Struct` doesn't work here because children can have different lengths, but what we could do is simply force the inner `Struct { types }` to hold `SparseArray` fields, which would effectively be the exact same but with the overhead of tracking indices for each of the child fields. In that case, it might just be better to always use a "sparse" union and let the compressor decide what to do.

0 commit comments

Comments
 (0)