Skip to content

feat: vortex row crate#8056

Draft
joseph-isaacs wants to merge 10 commits into
developfrom
ji/row-pr1-base
Draft

feat: vortex row crate#8056
joseph-isaacs wants to merge 10 commits into
developfrom
ji/row-pr1-base

Conversation

@joseph-isaacs
Copy link
Copy Markdown
Contributor

Add an empty vortex-row crate with a minimal initialize stub so the
following commits can layer in the row-encoder, codec, scalar functions,
and per-encoding kernels without touching the workspace skeleton each
time. The crate is wired into the workspace members list and workspace
dependency table; public-api.lock is generated against the stub.

Signed-off-by: Claude noreply@anthropic.com<!--
Thank you for submitting a pull request! We appreciate your time and effort.

Please make sure to provide enough information so that we can review your pull
request. The Summary and Testing sections below contain guidance on what to
include.
-->

Summary

Closes: #000

Testing

claude added 8 commits May 17, 2026 22:00
Add an empty `vortex-row` crate with a minimal `initialize` stub so the
following commits can layer in the row-encoder, codec, scalar functions,
and per-encoding kernels without touching the workspace skeleton each
time. The crate is wired into the workspace members list and workspace
dependency table; `public-api.lock` is generated against the stub.

Signed-off-by: Claude <noreply@anthropic.com>
Introduce the per-column sort-field options and the variadic-function
options struct used by the upcoming RowSize / RowEncode scalar functions.

`RowEncodeOptions::fields` uses a `SmallVec<[SortField; 4]>` so typical
1-4 column keys avoid a heap allocation. Includes a compact serialize /
deserialize helper used later by the scalar-function metadata round-trip.

Signed-off-by: Claude <noreply@anthropic.com>
Add the byte-encoding kernels for the fixed-width portion of the row
encoder: Null, Bool, Primitive (12 PTypes), and Decimal (i8..i128). Each
encoder writes a 1-byte sentinel followed by the value's row-comparable
bytes (sign-flipped big-endian for signed ints, sign-aware mask for
floats, etc.).

The size pass is a constant `width-per-row` add for these types; the
encode pass walks rows and writes into the shared output buffer at
`offsets[i] + cursors[i]`. `row_width_for_dtype` classifies the column
based purely on its DType.

Scalar-level encoders (`encode_scalar_primitive` / `encode_scalar_bool`
/ `encode_scalar_null` / `encode_scalar` / `encoded_size_for_scalar`)
are included for the same fixed-width subset; varlen and nested
canonical variants bail with a clear "not yet supported" error and
land in follow-up commits.

The implementation is deliberately the simplest correct version:
bounds-checked array indexing, no `copy_nonoverlapping`, no validity
fast-path helper. Subsequent PRs evolve this toward the optimized form.

Signed-off-by: Claude <noreply@anthropic.com>
Extend the codec to handle Utf8/Binary via VarBinView arrays. Each value
encodes as a 1-byte sentinel followed by 32-byte chunks: every full
chunk has a 0xFF continuation marker; the final partial chunk pads with
zeros and writes the partial length (1..=32) as its trailing byte.

`encode_varlen_value` uses the simple byte-at-a-time XOR loop here; a
faster `copy_nonoverlapping` + stamped continuation version replaces it
in PR 2. `encode_varbinview` uses `arr.with_iterator(...)` for both the
nullable and non-nullable branches; a direct view walk for the no-nulls
branch lands in PR 2 too.

`row_width_for_dtype` now returns `Variable` for Utf8/Binary; the size
pass and encode dispatchers route through `add_size_varbinview` /
`encode_varbinview` correspondingly. The scalar encoder gains
`encode_scalar_varlen` and the matching Utf8/Binary arms.

Signed-off-by: Claude <noreply@anthropic.com>
Extend the codec to handle Struct, FixedSizeList, and Extension
canonical variants. Each nested row encodes as `outer_sentinel | child
bytes...`; for null rows the child bytes are zero-filled after the
recursive encoders run so two null rows compare equal regardless of
which non-null values would have been written by the children.

`row_width_for_dtype` recurses through Struct fields and FSL elements
to return `Fixed(w)` when every leaf is fixed; otherwise `Variable`.
Extension delegates to its storage dtype. List remains `Variable` and
ListView still bails (the row encoder's output is itself a ListView, so
nested ListView isn't a near-term use case). Variant and Union bail
explicitly.

Signed-off-by: Claude <noreply@anthropic.com>
Add the size-pass machinery used by both RowSize and the upcoming
RowEncode pipeline. `compute_sizes` walks the N input columns once,
classifying each via `row_width_for_dtype` and accumulating
fixed-width-prefix sums in `fixed_per_row` while pushing per-row sums
of variable-length columns into a lazily allocated `var_lengths` vec.

The classification result (`ColKind` + `SizePassResult`) is private to
the crate; RowEncode consumes it in a later commit to choose between
the arithmetic and cursor encode paths.

`RowSize` returns a `Struct { fixed: U32, var: U32 }` so callers can
read the per-row width without realizing the constant `fixed` slot as
a per-row buffer (it's a `ConstantArray`); the `var` slot is a
`ConstantArray(0)` when no varlen column is present.

`dispatch_size` is the fallback-only path for PR 1 (canonicalize, then
codec::field_size). The `RowSizeKernel` trait exists but is unused; per-
encoding fast paths and the inventory registry arrive in PR 3.

`initialize()` does NOT register RowSize yet - that lands once
RowEncode is in place, so the session-registered pair appears together.

Signed-off-by: Claude <noreply@anthropic.com>
Add the RowEncode variadic scalar function: encode N input columns into
a single ListView<u8> in a five-phase pipeline.

  Phase 1: size pass via `compute_sizes`.
  Phase 2: allocate a zero-initialized output buffer sized to fit every
           row's encoded bytes; bail if the total exceeds u32::MAX.
  Phase 3: build per-row `listview_offsets`: i * fixed_per_row for the
           pure-fixed case, or i * fixed_per_row + exclusive cumsum of
           varlen lengths otherwise. Uses the simple `Vec::push` +
           `checked_add` loop.
  Phase 4: walk columns left-to-right and call `dispatch_encode` for
           every column (cursor path for all). Each call writes its
           per-row bytes at `offsets[i] + cursors[i]` and advances the
           cursor.
  Phase 5: build the ListView<u8> via the validating `try_new`
           constructor.

`dispatch_encode` is the canonicalize-then-`codec::field_encode`
fallback; in-crate kernel arms and the inventory registry land in PR 3.
The `RowEncodeKernel` trait is defined but unused. PR 2 will iterate
on this pipeline (skip zero-init, skip ListView validation, auto-
vectorize the offsets loop, etc.).

Signed-off-by: Claude <noreply@anthropic.com>
Wire the RowSize/RowEncode scalar functions to the user-facing API:

- `convert_columns` accepts a slice of input arrays and per-column
  SortFields, constructs `RowEncodeOptions` + `VecExecutionArgs`, and
  returns the encoded `ListViewArray<u8>`.
- `compute_row_sizes` returns just the per-row sizes (the `Struct
  { fixed: u32, var: u32 }` output of `RowSize`).
- `initialize()` now registers `RowSize` and `RowEncode` on the given
  session so they are reachable via the expression layer.

Tests cover sort-order round-trips for bool, primitive (i64 asc/desc,
u32, f64), utf8, multi-column, nulls_first/last, struct sort-order, the
single-buffer invariant of the ListView output, and the structural
shape of `RowSize`. Tests that exercise per-encoding fast paths
(`constant_path_matches_canonical`, `dict_path_matches_canonical`) land
together with their respective kernels in PR 3.

The bench file uses divan + mimalloc and reports throughput in GB/s of
encoded output bytes for primitive_i64, utf8, and struct_mixed. Each
has an `arrow_row` baseline and a `vortex` measurement. Per-encoding
fast-path scenarios (constant/dict/patched/bitpacked/for/delta) gain
their triplets in PR 3.

Baseline measurements at this commit (sample-count=10):
  primitive_i64_vortex  ~1.97 GB/s  (vs arrow-row 4.12 GB/s)
  utf8_vortex           ~0.87 GB/s  (vs arrow-row 1.56 GB/s)
  struct_mixed_vortex   ~0.95 GB/s  (vs arrow-row 1.19 GB/s)

PR 2 closes most of the gap by replacing the validating
`ListViewArray::try_new` with `new_unchecked`, skipping the buffer
zero-init, auto-vectorizing the offsets and varlen-block paths, etc.

Signed-off-by: Claude <noreply@anthropic.com>
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 22, 2026

Merging this PR will improve performance by 19.8%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 1 improved benchmark
✅ 1250 untouched benchmarks

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation chunked_varbinview_opt_canonical_into[(1000, 10)] 225.2 µs 188 µs +19.8%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing ji/row-pr1-base (48f59ce) with develop (1241e14)

Open in CodSpeed

t
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
t
Signed-off-by: Joe Isaacs <joe.isaacs@live.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants