|
| 1 | +- Feature Name: `extension_types` |
| 2 | +- Start Date: (2026-02-23) |
| 3 | +- RFC PR: [vortex-data/rfcs#0000](https://github.com/vortex-data/rfcs/pull/0000) |
| 4 | +- Tracking Issue: [vortex-data/vortex#6547](https://github.com/vortex-data/vortex/issues/6547) |
| 5 | + |
| 6 | +## Summary |
| 7 | + |
| 8 | +We would like to build a more robust system for extension data types (or DTypes). |
| 9 | + |
| 10 | +TODO |
| 11 | + |
| 12 | +## Motivation |
| 13 | + |
| 14 | +TODO |
| 15 | + |
| 16 | +## Design |
| 17 | + |
| 18 | +[vortex-data/vortex#6081](https://github.com/vortex-data/vortex/pull/6081) introduced vtables (virtual tables, or Rust unit structs with methods) for extension `DType`s. Each extension type (e.g. `Timestamp`) now implements `ExtDTypeVTable`, which handles validation, serialization, and metadata. The type-erased `ExtDTypeRef` carries this vtable with it inside `DType::Extension`. |
| 19 | + |
| 20 | +There were a few blockers (detailed in the previous tracking issue [vortex-data/vortex#6547](https://github.com/vortex-data/vortex/issues/6547)), but now that those have been resolved we can move forward with this. |
| 21 | + |
| 22 | +Now that `vortex-scalar` and `vortex-dtype` have been merged into `vortex-array`, we can now place all extension logic (for types, scalars, and arrays) onto an `ExtVTable`. It will look something like so: |
| 23 | + |
| 24 | +```rust |
| 25 | +// Naming should be considered VERY unstable / not set! |
| 26 | + |
| 27 | +pub trait ExtVTable: 'static + Send + Sync + ... { |
| 28 | + // Extra data that complements the extension type. |
| 29 | + type Metadata: ...; |
| 30 | + |
| 31 | + // A native Rust value that represents a scalar of the extension type. |
| 32 | + type Value<'a>: Display; |
| 33 | + |
| 34 | + // `DType` |
| 35 | + |
| 36 | + fn id(&self) -> ExtID; |
| 37 | + fn validate_dtype(&self, metadata: &Self::Metadata, storage_dtype: &DType) -> VortexResult<()>; |
| 38 | + fn serialize_metadata(&self, metadata: &Self::Metadata) -> VortexResult<Vec<u8>>; |
| 39 | + fn deserialize_metadata(&self, data: &[u8]) -> VortexResult<Self::Metadata>; |
| 40 | + |
| 41 | + // `Scalar` |
| 42 | + |
| 43 | + fn validate_scalar_value(&self, metadata: &Self::Metadata, storage_dtype: &DType, storage_value: &ScalarValue) -> VortexResult<()>; |
| 44 | + fn unpack<'a>(&self, metadata: &'a Self::Metadata, storage_dtype: &'a DType, storage_value: &'a ScalarValue) -> Self::Value<'a>; |
| 45 | + fn cast_scalar(&self, metadata: &Self::Metadata, scalar: &Scalar, target: &DType) -> VortexResult<Scalar> { ... } |
| 46 | + |
| 47 | + // `ArrayRef` |
| 48 | + |
| 49 | + fn validate_array(&self, metadata: &Self::Metadata, storage_array: &ArrayRef) -> VortexResult<()>; |
| 50 | + fn cast_array(&self, metadata: &Self::Metadata, array: &ArrayRef, target: &DType) -> VortexResult<ArrayRef> { ... } |
| 51 | + fn other_compute_thing???(&self, ...) -> VortexResult<ArrayRef> { ... } |
| 52 | + // <-- Probably a lot more than this --> |
| 53 | +} |
| 54 | +``` |
| 55 | + |
| 56 | +TODO |
| 57 | + |
| 58 | +## Compatibility |
| 59 | + |
| 60 | +TODO |
| 61 | + |
| 62 | +## Drawbacks |
| 63 | + |
| 64 | +TODO |
| 65 | + |
| 66 | +## Alternatives |
| 67 | + |
| 68 | +TODO |
| 69 | + |
| 70 | +## Prior Art |
| 71 | + |
| 72 | +TODO |
| 73 | + |
| 74 | +## Unresolved Questions |
| 75 | + |
| 76 | +TODO |
| 77 | + |
| 78 | +## Future Possibilities |
| 79 | + |
| 80 | +If we can get extension types working well, then theoretically we can easily add all of these types: |
| 81 | + |
| 82 | +- `DateTimeParts` (`Primitive`) |
| 83 | +- Matrix (`FixedSizeList`) |
| 84 | +- Tensor (`FixedSizeList`) |
| 85 | +- UUID (Do we need to add `FixedSizeBinary` as a canonical type?) |
| 86 | +- JSON (`UTF8`) |
| 87 | +- PDX: https://arxiv.org/pdf/2503.04422v1 (`FixedSizeList`) |
| 88 | +- Variant |
| 89 | + - Shredding (Lots of possibilities here!) |
| 90 | +- Union |
| 91 | + - Sparse (`Struct { Primitive, Struct { types } }`) |
| 92 | + - Dense[^1] |
| 93 | +- Map (`List<Struct { K, V }>`) |
| 94 | +- Tags: https://github.com/vortex-data/vortex/discussions/5772#discussioncomment-15279892 (`ListView<Utf8>`) |
| 95 | +- `Struct` but with protobuf-style field numbers (`Struct`) |
| 96 | +- Probably lots more! |
| 97 | + |
| 98 | +[^1]: `Struct` doesn't work here because children can have different lengths, but what we could do is simply force the inner `Struct { types }` to hold `SparseArray` fields, which would effectively be the exact same but with the overhead of tracking indices for each of the child fields. In that case, it might just be better to always use a "sparse" union and let the compressor decide what to do. |
0 commit comments