Skip to content

Commit 4cee32e

Browse files
committed
first draft extension_types
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
1 parent 1222fb3 commit 4cee32e

File tree

1 file changed

+98
-0
lines changed

1 file changed

+98
-0
lines changed

text/0001-extension-types.md

Lines changed: 98 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,98 @@
1+
- Feature Name: `extension_types`
2+
- Start Date: (2026-02-23)
3+
- RFC PR: [vortex-data/rfcs#0000](https://github.com/vortex-data/rfcs/pull/0000)
4+
- Tracking Issue: [vortex-data/vortex#6547](https://github.com/vortex-data/vortex/issues/6547)
5+
6+
## Summary
7+
8+
We would like to build a more robust system for extension data types (or DTypes).
9+
10+
TODO
11+
12+
## Motivation
13+
14+
TODO
15+
16+
## Design
17+
18+
[vortex-data/vortex#6081](https://github.com/vortex-data/vortex/pull/6081) introduced vtables (virtual tables, or Rust unit structs with methods) for extension `DType`s. Each extension type (e.g. `Timestamp`) now implements `ExtDTypeVTable`, which handles validation, serialization, and metadata. The type-erased `ExtDTypeRef` carries this vtable with it inside `DType::Extension`.
19+
20+
There were a few blockers (detailed in the previous tracking issue [vortex-data/vortex#6547](https://github.com/vortex-data/vortex/issues/6547)), but now that those have been resolved we can move forward with this.
21+
22+
Now that `vortex-scalar` and `vortex-dtype` have been merged into `vortex-array`, we can now place all extension logic (for types, scalars, and arrays) onto an `ExtVTable`. It will look something like so:
23+
24+
```rust
25+
// Naming should be considered VERY unstable / not set!
26+
27+
pub trait ExtVTable: 'static + Send + Sync + ... {
28+
// Extra data that complements the extension type.
29+
type Metadata: ...;
30+
31+
// A native Rust value that represents a scalar of the extension type.
32+
type Value<'a>: Display;
33+
34+
// `DType`
35+
36+
fn id(&self) -> ExtID;
37+
fn validate_dtype(&self, metadata: &Self::Metadata, storage_dtype: &DType) -> VortexResult<()>;
38+
fn serialize_metadata(&self, metadata: &Self::Metadata) -> VortexResult<Vec<u8>>;
39+
fn deserialize_metadata(&self, data: &[u8]) -> VortexResult<Self::Metadata>;
40+
41+
// `Scalar`
42+
43+
fn validate_scalar_value(&self, metadata: &Self::Metadata, storage_dtype: &DType, storage_value: &ScalarValue) -> VortexResult<()>;
44+
fn unpack<'a>(&self, metadata: &'a Self::Metadata, storage_dtype: &'a DType, storage_value: &'a ScalarValue) -> Self::Value<'a>;
45+
fn cast_scalar(&self, metadata: &Self::Metadata, scalar: &Scalar, target: &DType) -> VortexResult<Scalar> { ... }
46+
47+
// `ArrayRef`
48+
49+
fn validate_array(&self, metadata: &Self::Metadata, storage_array: &ArrayRef) -> VortexResult<()>;
50+
fn cast_array(&self, metadata: &Self::Metadata, array: &ArrayRef, target: &DType) -> VortexResult<ArrayRef> { ... }
51+
fn other_compute_thing???(&self, ...) -> VortexResult<ArrayRef> { ... }
52+
// <-- Probably a lot more than this -->
53+
}
54+
```
55+
56+
TODO
57+
58+
## Compatibility
59+
60+
TODO
61+
62+
## Drawbacks
63+
64+
TODO
65+
66+
## Alternatives
67+
68+
TODO
69+
70+
## Prior Art
71+
72+
TODO
73+
74+
## Unresolved Questions
75+
76+
TODO
77+
78+
## Future Possibilities
79+
80+
If we can get extension types working well, then theoretically we can easily add all of these types:
81+
82+
- `DateTimeParts` (`Primitive`)
83+
- Matrix (`FixedSizeList`)
84+
- Tensor (`FixedSizeList`)
85+
- UUID (Do we need to add `FixedSizeBinary` as a canonical type?)
86+
- JSON (`UTF8`)
87+
- PDX: https://arxiv.org/pdf/2503.04422v1 (`FixedSizeList`)
88+
- Variant
89+
- Shredding (Lots of possibilities here!)
90+
- Union
91+
- Sparse (`Struct { Primitive, Struct { types } }`)
92+
- Dense[^1]
93+
- Map (`List<Struct { K, V }>`)
94+
- Tags: https://github.com/vortex-data/vortex/discussions/5772#discussioncomment-15279892 (`ListView<Utf8>`)
95+
- `Struct` but with protobuf-style field numbers (`Struct`)
96+
- Probably lots more!
97+
98+
[^1]: `Struct` doesn't work here because children can have different lengths, but what we could do is simply force the inner `Struct { types }` to hold `SparseArray` fields, which would effectively be the exact same but with the overhead of tracking indices for each of the child fields. In that case, it might just be better to always use a "sparse" union and let the compressor decide what to do.

0 commit comments

Comments
 (0)