Skip to content

perf: optimize vector marshal/unmarshal for float32/float64/int32/int64 (Throughput: 75 MiB/s → 1.9 GiB/s (marshal), 106 MiB/s → 4.6 GiB/s (unmarshal) )#770

Draft
mykaul wants to merge 4 commits intoscylladb:masterfrom
mykaul:vector-perf-optimize

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Mar 13, 2026

Summary

Type-specialized fast paths for vector<float>, vector<double>, vector<int>, vector<bigint>, and vector<uuid>/vector<timeuuid> that bypass reflect-based per-element marshaling in favor of direct encoding/binary bulk conversion, plus sync.Pool buffer reuse wired into the connection write path, and a VectorType.NewWithError() fast path that eliminates the expensive goType()asVectorType() re-parse on every call.

Commit 1: d527db1 perf: optimize vector marshal/unmarshal for float32/float64/int32/int64

Fast-path type switches, 8 dedicated marshal/unmarshal functions, sync.Pool infrastructure (getVectorBuf/putVectorBuf), unmarshal slice reuse, generic-path buf.Grow() preallocation via vectorFixedElemSize(), and comprehensive tests (58 subtests across 13 categories).

Commit 2: 04f0783 perf: wire putVectorBuf into connection write path

Adds defer putVectorBuf(...) calls in executeQuery() and executeBatch() in conn.go, so production callers return pooled marshal buffers after the framer copies them. This closes the pool lifecycle and achieves 48 B/op steady-state on the write path.

Commit 3: d48f44f perf: add UUID/TimeUUID vector fast path

Adds marshal/unmarshal fast paths for vector<uuid> and vector<timeuuid> — bulk copy() of fixed 16-byte elements with zero per-element allocations. UUID vectors are common in similarity search use cases (storing document IDs alongside embeddings).

Commit 4: 66ed9aa perf: add VectorType.NewWithError() to avoid goType/asVectorType re-parse

VectorType embeds NativeType but had no NewWithError() method. When called, it dispatched to NativeType.NewWithError() which hit TypeCustomgoType()asVectorType(), re-parsing the full Java type string on every call. The new method returns *[]SubType directly: 10.7x faster (181ns → 17ns), 75% fewer allocs (4 → 1), 74% less memory (92B → 24B).

Headline numbers (vector<float, 1536>, typical embedding dimension)

  • 22x faster marshal (fast paths alone), 41x with pool recycling (see Pooled benchmarks)
  • 36x faster unmarshal, zero allocations steady state
  • 99.93% fewer allocations on marshal (3,074 → 2)
  • Marshal memory: 18,456 B/op → 6,172 B/op (fast paths) → 48 B/op (with pool recycling)
  • Unmarshal memory: 6,168 B/op → 0 B/op
  • 10.7x faster NewWithError() for VectorType (181ns → 17ns)

Benchmark results

All benchmarks: 6 iterations, benchstat, all p=0.002. Machine: 12th Gen Intel Core i7-1270P.

Master = 3881f1e (origin/master), Optimized = 66ed9aa (this branch HEAD).

Latency (ns/op) — Master vs Optimized (all 4 commits)

Benchmark Master Optimized Speedup
float32
MarshalVectorFloat32/dim_128 4,641 285 16.3x
MarshalVectorFloat32/dim_384 16,237 699 23.2x
MarshalVectorFloat32/dim_768 26,785 1,234 21.7x
MarshalVectorFloat32/dim_1536 54,049 2,415 22.4x
UnmarshalVectorFloat32/dim_128 3,116 95 32.8x
UnmarshalVectorFloat32/dim_384 10,647 280 38.0x
UnmarshalVectorFloat32/dim_768 18,536 586 31.6x
UnmarshalVectorFloat32/dim_1536 39,186 1,092 35.9x
float64
MarshalVectorFloat64/dim_128 4,700 380 12.4x
MarshalVectorFloat64/dim_384 13,125 985 13.3x
MarshalVectorFloat64/dim_768 29,809 1,905 15.6x
MarshalVectorFloat64/dim_1536 59,207 3,754 15.8x
int32
MarshalVectorInt32/dim_128 4,599 249 18.5x
MarshalVectorInt32/dim_384 13,560 654 20.7x
MarshalVectorInt32/dim_768 25,402 1,166 21.8x
MarshalVectorInt32/dim_1536 47,432 2,262 21.0x
UnmarshalVectorInt32/dim_128 3,201 94 34.1x
UnmarshalVectorInt32/dim_384 9,763 279 35.0x
UnmarshalVectorInt32/dim_768 19,700 547 36.0x
UnmarshalVectorInt32/dim_1536 40,106 1,073 37.4x
int64
MarshalVectorInt64/dim_128 4,643 368 12.6x
MarshalVectorInt64/dim_384 13,578 952 14.3x
MarshalVectorInt64/dim_768 26,887 1,834 14.7x
MarshalVectorInt64/dim_1536 62,726 3,636 17.3x
UnmarshalVectorInt64/dim_128 3,387 111 30.5x
UnmarshalVectorInt64/dim_384 9,854 343 28.7x
UnmarshalVectorInt64/dim_768 21,628 601 36.0x
UnmarshalVectorInt64/dim_1536 44,880 1,190 37.7x
UUID
MarshalVectorUUID/dim_128 8,154 710 11.5x
MarshalVectorUUID/dim_384 25,949 1,994 13.0x
MarshalVectorUUID/dim_768 52,738 3,861 13.7x
MarshalVectorUUID/dim_1536 95,785 7,480 12.8x
UnmarshalVectorUUID/dim_128 3,688 119 31.0x
UnmarshalVectorUUID/dim_384 11,164 330 33.8x
UnmarshalVectorUUID/dim_768 22,201 651 34.1x
UnmarshalVectorUUID/dim_1536 44,435 1,294 34.3x
NewWithError / RowData
VectorNewWithError/VectorType 181 17 10.7x
VectorNewWithError/NativeType_fallback 183 168 1.1x
RowDataWithVector 503 108 4.7x

Pool wiring benefit (Commit 2) — production write path

The table above measures Marshal()/Unmarshal() via the public API, which does not return buffers to the pool. In production, executeQuery()/executeBatch() return the buffer via putVectorBuf() after the framer copies it. The Pooled benchmarks simulate this:

Benchmark Master Fast-path only Speedup vs master + Pool return Speedup vs master
MarshalFloat32Pooled/dim_128 4,641 285 16.3x 161 28.8x
MarshalFloat32Pooled/dim_384 16,237 699 23.2x 369 44.0x
MarshalFloat32Pooled/dim_768 26,785 1,234 21.7x 679 39.4x
MarshalFloat32Pooled/dim_1536 54,049 2,415 22.4x 1,306 41.4x
MarshalInt32Pooled/dim_1536 47,432 2,262 21.0x 1,100 43.1x
MarshalInt64Pooled/dim_1536 62,726 3,636 17.3x 1,299 48.3x
MarshalUUIDPooled/dim_1536 95,785 7,480 12.8x 1,440 66.5x

Memory with pool return is 48 B/op constant, regardless of vector dimension or element type (from sync.Pool interface boxing overhead, irreducible). Compare to master: 18,456 B/op for float32/dim_1536, 98,328 B/op for UUID/dim_1536.

Full benchstat details: memory and allocations (click to expand)

Memory (B/op)

Benchmark Master Optimized Change
RowDataWithVector 216 144 -33.33%
UnmarshalVectorFloat32/dim_128 536 0 -100.00%
UnmarshalVectorFloat32/dim_384 1,560 0 -100.00%
UnmarshalVectorFloat32/dim_768 3,096 0 -100.00%
UnmarshalVectorFloat32/dim_1536 6,168 0 -100.00%
MarshalVectorFloat32/dim_128 1,560 536 -65.64%
MarshalVectorFloat32/dim_384 4,632 1,561 -66.30%
MarshalVectorFloat32/dim_768 9,240 3,098 -66.47%
MarshalVectorFloat32/dim_1536 18,456 6,172 -66.55%
MarshalVectorFloat64/dim_128 3,096 1,048 -66.15%
MarshalVectorFloat64/dim_384 9,240 3,098 -66.47%
MarshalVectorFloat64/dim_768 18,456 6,173 -66.55%
MarshalVectorFloat64/dim_1536 36,888 12,319 -66.60%
MarshalVectorInt32/dim_128 1,560 536 -65.64%
MarshalVectorInt32/dim_384 4,632 1,561 -66.30%
MarshalVectorInt32/dim_768 9,240 3,098 -66.47%
MarshalVectorInt32/dim_1536 18,456 6,172 -66.55%
MarshalVectorInt64/dim_128 3,096 1,048 -66.15%
MarshalVectorInt64/dim_384 9,240 3,098 -66.47%
MarshalVectorInt64/dim_768 18,456 6,173 -66.55%
MarshalVectorInt64/dim_1536 36,888 12,319 -66.60%
MarshalVectorUUID/dim_128 8,216 2,072 -74.77%
MarshalVectorUUID/dim_384 24,600 6,172 -74.91%
MarshalVectorUUID/dim_768 49,176 12,312 -74.94%
MarshalVectorUUID/dim_1536 98,328 24,600 -74.96%
UnmarshalVectorInt32 (all dims) 536–6,168 0 -100.00%
UnmarshalVectorInt64 (all dims) 1,048–12,312 0 -100.00%
UnmarshalVectorUUID (all dims) 2,072–24,600 0 -100.00%
VectorNewWithError/VectorType 92 24 -73.91%

Allocations (allocs/op)

Benchmark Master Optimized Change
RowDataWithVector 8 5 -37.50%
MarshalVectorFloat32/dim_128 258 2 -99.22%
MarshalVectorFloat32/dim_384 770 2 -99.74%
MarshalVectorFloat32/dim_768 1,538 2 -99.87%
MarshalVectorFloat32/dim_1536 3,074 2 -99.93%
MarshalVectorFloat64 (all dims) 258–3,074 2 -99.22% to -99.93%
MarshalVectorInt32 (all dims) 258–3,074 2 -99.22% to -99.93%
MarshalVectorInt64 (all dims) 258–3,074 2 -99.22% to -99.93%
MarshalVectorUUID/dim_128 386 2 -99.48%
MarshalVectorUUID/dim_384 1,154 2 -99.83%
MarshalVectorUUID/dim_768 2,306 2 -99.91%
MarshalVectorUUID/dim_1536 4,610 2 -99.96%
All unmarshal (all types, all dims) 2 0 -100.00%
VectorNewWithError/VectorType 4 1 -75.00%

Pooled marshal memory (B/op) — with pool return

Benchmark B/op
MarshalFloat32Pooled (all dims) 48
MarshalInt32Pooled (all dims) 48
MarshalInt64Pooled (all dims) 48
MarshalUUIDPooled (all dims) 48

What changed

marshal.go

  1. Fast-path type switches in marshalVector() and unmarshalVector() — before the existing reflect-based generic path, a switch on info.SubType.Type() intercepts []float32, []float64, []int32, []int64, []UUID and dispatches to 10 dedicated functions. Falls through to the generic path for all other types.

  2. 10 new marshal/unmarshal functionsmarshalVectorFloat32/Float64/Int32/Int64/UUID and corresponding unmarshalVector*. Float/int functions use encoding/binary.BigEndian.PutUint32/PutUint64 with math.Float32bits/Float64bits. UUID functions use bulk copy() of 16-byte elements.

  3. sync.Pool buffer reusevectorBufPool, getVectorBuf(size), putVectorBuf(buf) with 64 KiB cap guard.

  4. Unmarshal slice reuse — All unmarshal fast paths reuse the destination slice's backing array when capacity is sufficient, achieving zero allocations on repeated reads.

  5. Generic path preallocationvectorFixedElemSize() returns the wire-format byte size for fixed-length CQL types. The generic path calls buf.Grow() upfront.

  6. VectorType.NewWithError() — Returns *[]SubType directly without going through goType()asVectorType(). 10.7x faster, 75% fewer allocs.

conn.go

  1. Pool return wiringexecuteQuery() and executeBatch() call defer putVectorBuf(...) on each queryValues.value, closing the pool lifecycle.

Test files

  • marshal_vector_test.go — 58 unit subtests across 13 categories including UUID-specific tests
  • vector_bench_test.go — Benchmarks for all 5 types: marshal, pooled marshal, unmarshal, across dimensions 128/384/768/1536
  • marshal_test.goTestVectorNewWithErrorConsistentWithGoType, TestVectorNewWithErrorReturnsSlicePointer
  • helpers_bench_test.goBenchmarkVectorNewWithError, BenchmarkRowDataWithVector

How the bottleneck was eliminated

The original generic path for vector<float, 1536>:

  1. Called Marshal() 1,536 times through reflect dispatch
  2. Each call allocated a 4-byte []byte via encFloat32 (1,536 allocs)
  3. Appended each to a bytes.Buffer that grew incrementally (additional allocs)
  4. The buffer was then copied into framer.buf by writeBytes()

The fast path:

  1. Single getVectorBuf(6144) (pooled, zero-alloc steady state)
  2. Tight loop: binary.BigEndian.PutUint32(buf[i*4:], math.Float32bits(v))
  3. No reflect, no per-element dispatch, no intermediate allocations
  4. putVectorBuf() returns the buffer to the pool after c.exec()

Design decisions

  • Phase 3 (write-through to framer) was deliberately skipped — encoding directly into framer.buf would save ~0.4 µs (~21% marginal over the pooled path) but requires invasive changes to the queryValues struct used by all query paths. The risk/reward ratio is unfavorable.
  • Phase 4 (fix isVectorVariableLengthType) was deliberately skipped — there is a discrepancy in how some types are handled between Cassandra and ScyllaDB implementations. We focus only on the 5 types where the wire format is unambiguous.

Relationship to open PRs

Replaces PR #744 (float fast paths) and PR #745 (generic prealloc)

This PR is a strict superset of both:

Feature PR #744 PR #745 This PR
Float32/Float64 fast marshal yes yes
Float32/Float64 fast unmarshal + slice reuse yes yes
Int32/Int64 fast marshal/unmarshal yes
UUID/TimeUUID fast marshal/unmarshal yes
sync.Pool buffer reuse yes
Pool wiring in conn.go yes
VectorType.NewWithError() yes
vectorFixedElemSize() helper yes yes
Generic buf.Grow() prealloc yes yes

If this PR merges first, #744 and #745 become no-ops and should be closed.

Orthogonal to PRs #751, #752, #753

These PRs reduce per-request allocations in the connection/framing layer. This PR reduces allocations in the marshal/unmarshal layer. They are fully complementary — different files, different allocation sites, additive benefits.

Note: PR #749 (pool write-side framers) was closed — fully superseded by 3e1e7e4 on master.

Depends on PR #838

PR #838 fixes a pre-existing build failure in session_unit_test.go where hostId changed from string to UUID but test literals were not updated. This branch carries the same fix; once #838 merges, the fix becomes a no-op on rebase.

@mykaul mykaul marked this pull request as draft March 13, 2026 11:52
@mykaul mykaul requested a review from Copilot March 13, 2026 11:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces high-performance, type-specialized marshal/unmarshal paths for common numeric vector element types to reduce reflect overhead and allocations in the gocql CQL codec layer.

Changes:

  • Added fast paths in marshalVector / unmarshalVector for []float32, []float64, []int32, []int64 plus a sync.Pool-backed buffer helper for marshal-side reuse.
  • Added a generic-path preallocation helper (vectorFixedElemSize + buf.Grow) to reduce bytes.Buffer growth for fixed-size element types.
  • Added extensive unit tests for vector behavior and expanded internal/public benchmarks for new int32/int64 vector cases and pooled scenarios.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
marshal.go Adds vector fast paths, vector buffer pool helpers, and generic vector preallocation support.
marshal_vector_test.go New comprehensive unit tests for vector fast paths, pooling helpers, and preallocation behavior.
vector_bench_test.go Adds/extends internal benchmarks for pooled marshal and int32/int64 vector performance.
tests/bench/bench_vector_public_test.go Extends public API benchmarks to cover int32/int64 vector marshal/unmarshal via gocql.Marshal/Unmarshal.

Comment thread marshal.go Outdated
Comment thread marshal.go
Comment thread marshal_vector_test.go
Comment thread marshal_vector_test.go
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces type-specialized fast paths for vector marshal/unmarshal (float32/float64/int32/int64) in marshal.go to avoid reflect-heavy per-element encoding, plus expanded benchmarks and a new, comprehensive unit-test suite to validate correctness and performance characteristics.

Changes:

  • Add fast-path vector marshal/unmarshal implementations using encoding/binary bulk conversion and destination-slice reuse.
  • Add sync.Pool-backed buffer helpers (getVectorBuf/putVectorBuf) and generic-path preallocation via vectorFixedElemSize.
  • Expand internal and public benchmarks; add a large new unit test file covering many vector edge cases.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
marshal.go Adds specialized vector marshal/unmarshal fast paths, pooling helpers, and generic-path preallocation.
marshal_vector_test.go New unit tests for vector behavior (round-trip, byte-compat, slice reuse, pool behavior, etc.).
vector_bench_test.go Adds pooled/write-path/round-trip benchmarks plus int32/int64 benchmark coverage.
tests/bench/bench_vector_public_test.go Extends public API benchmarks to cover int32/int64 vectors.

Comment thread marshal.go Outdated
Comment thread marshal_vector_test.go Outdated
Comment thread marshal_vector_test.go Outdated
Comment thread marshal_vector_test.go Outdated
Comment thread marshal_vector_test.go Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR accelerates CQL vector encoding/decoding in the GoCQL driver by introducing type-specialized marshal/unmarshal fast paths for common numeric vector element types, reducing reflection overhead and allocations on hot paths.

Changes:

  • Add specialized marshal/unmarshal implementations for []float32, []float64, []int32, []int64 using encoding/binary + bit conversions, with unmarshal slice reuse.
  • Introduce vectorBufPool (sync.Pool) helpers for reusable marshal buffers and add generic-path preallocation via vectorFixedElemSize() + bytes.Buffer.Grow().
  • Add extensive unit tests for vector behavior and expand internal + public benchmarks for the new fast paths.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File Description
marshal.go Adds vector fast paths, buffer pooling helpers, and generic-path preallocation helper.
marshal_vector_test.go New unit test suite covering round-trip, compatibility, reuse, pool behavior, and prealloc.
vector_bench_test.go Adds pooled/unpooled benchmarks and a simulated write-path benchmark for vectors.
tests/bench/bench_vector_public_test.go Extends public API benchmarks to cover int32/int64 vectors.

Comment thread marshal.go Outdated
Comment thread marshal.go Outdated
Comment thread marshal.go Outdated
Comment thread marshal.go Outdated
Comment thread marshal.go
Comment thread marshal.go Outdated
Comment thread marshal.go Outdated
Comment thread marshal.go Outdated
Comment thread marshal.go Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds specialized, non-reflect fast paths for marshaling/unmarshaling common vector<> element types in the GoCQL driver to significantly reduce allocations and improve throughput, along with extensive tests and expanded benchmarks.

Changes:

  • Add type-specialized vector marshal/unmarshal implementations for []float32, []float64, []int32, and []int64, plus a pooled []byte buffer facility for marshal fast paths.
  • Improve generic vector marshal performance via preallocation (buf.Grow) when element wire size is known.
  • Add a comprehensive unit test suite for vector behavior and extend internal + public benchmarks for the new int32/int64 paths and pooled scenarios.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File Description
marshal.go Adds vector fast paths, buffer pooling helpers, overflow guards, and generic-path preallocation.
marshal_vector_test.go New comprehensive unit tests covering fast paths, edge cases, and pool behavior.
vector_bench_test.go Adds pooled/write-path/round-trip benchmarks and int32/int64 benchmarks.
tests/bench/bench_vector_public_test.go Extends public API benchmarks to cover int32/int64 vectors.

Comment thread marshal.go
Comment thread marshal.go Outdated
Comment thread marshal.go
Comment thread marshal.go
Comment thread marshal.go
Comment thread marshal.go
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces type-specialized fast paths for marshaling/unmarshaling common fixed-width vector element types (float32/float64/int32/int64) to avoid reflect-based per-element work, reducing allocations and significantly improving throughput in the driver’s vector serialization layer.

Changes:

  • Added fast-path dispatch in marshalVector/unmarshalVector with dedicated bulk encode/decode implementations for float32/float64/int32/int64 vectors.
  • Introduced a sync.Pool-backed byte buffer reuse mechanism for vector marshaling and slice-backing reuse for unmarshaling.
  • Added extensive unit tests plus expanded internal/public benchmarks for the new fast paths and pooled usage patterns.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
marshal.go Adds fast-path vector marshal/unmarshal implementations, buffer pooling helpers, and generic-path preallocation.
marshal_vector_test.go Adds a comprehensive unit test suite for vector behavior, pooling, edge cases, and compatibility.
vector_bench_test.go Adds/extends internal benchmarks for pooled marshal, write-path simulation, and int vector types.
tests/bench/bench_vector_public_test.go Extends public API benchmarks to cover int32/int64 vector marshal/unmarshal.

Comment thread marshal_vector_test.go
Comment thread marshal.go Outdated
Comment thread marshal.go
@mykaul mykaul force-pushed the vector-perf-optimize branch from 9e5efc1 to 234f670 Compare March 17, 2026 18:04
@mykaul
Copy link
Copy Markdown
Author

mykaul commented Mar 17, 2026

Addressed review feedback:

Fixed (this push):

  1. getVectorBuf(0) now returns non-nil empty slice instead of nil. Previously, marshaling a non-nil empty vector ([]float32{}) with dim==0 would return nil, which framer.writeBytes encodes as CQL NULL. Now it correctly returns make([]byte, 0), distinguishing empty vectors from NULL.
  2. Added dim==0 array validation: When Dimensions==0 and the destination is *[N]T where N!=0, we now return an error ("array of size N cannot store vector of 0 dimensions") instead of silently succeeding and leaving the array unchanged.
  3. Strengthened empty-vector tests: All 4 TestMarshalVector_EmptyVector subtests now assert data != nil in addition to len(data) == 0, catching the nil-vs-empty distinction.

Already fixed in prior revisions:

  • Magic number 0x0015uint16(TypeDuration) (already done)
  • Displaced isVectorVariableLengthType doc comment (already adjacent to function)
  • -0 sign bit preservation tests (already have explicit Float32bits/Float64bits checks)
  • dim==0 unmarshal fast-paths already return non-nil empty slices via make([]float32, 0) etc.

Not a bug (Copilot false positives):

  • &result[:1][0] on make([]float32, 0, dim+10): This does NOT panic. Go allows reslicing up to capacity, so result[:1] is valid when cap(result) >= 1.
  • dim * 4 overflow: Already handled by vectorByteSize() which uses int64 arithmetic and checks for overflow.
  • vectorByteSize returning fmt.Errorf: All callers already wrap the error with marshalErrorf/unmarshalErrorf, so the final error type is correct.

Not changing (style preference):

  • readColWithSpec high-arity signature: Internal function with 2 call sites in the same file. A struct would add indirection without clear benefit.
  • Missing fixed-size types in vectorFixedElemSize: The function covers the types that have vector fast-paths. Other types fall through to the generic path which handles them correctly.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds type-specialized marshal/unmarshal fast paths for common vector<...> element types to significantly reduce reflection overhead and allocations in the driver’s value encoding/decoding layer.

Changes:

  • Introduces specialized marshal/unmarshal implementations for []float32, []float64, []int32, []int64 plus a sync.Pool-backed buffer helper for marshal fast paths.
  • Adds generic-path preallocation for fixed-size vector element types and improves 0-dimension handling in unmarshal.
  • Expands benchmarks and adds a comprehensive new unit test suite for vector behavior/performance characteristics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
marshal.go Adds vector fast paths, pooled buffer helpers, fixed-element-size prealloc, and 0-dimension unmarshal handling.
marshal_vector_test.go New, extensive unit tests for vector marshal/unmarshal correctness, edge cases, and pooling behavior.
vector_bench_test.go Extends internal benchmarks to cover pooled write-path simulations and int32/int64 vectors.
tests/bench/bench_vector_public_test.go Extends public API benchmarks to include int32/int64 vector marshal/unmarshal.

Comment thread marshal_vector_test.go Outdated
Comment thread marshal.go
Comment thread marshal_vector_test.go Outdated
Comment thread marshal_vector_test.go Outdated
Comment thread marshal_vector_test.go Outdated
@mykaul mykaul force-pushed the vector-perf-optimize branch from b0a2d82 to 118e06c Compare March 24, 2026 18:21
@mykaul mykaul force-pushed the vector-perf-optimize branch 2 times, most recently from 432f624 to 9410b42 Compare April 4, 2026 11:54
@dkropachev
Copy link
Copy Markdown
Collaborator

@mykaul , it is great idea to use pooled buffers, but i think we need to make it generic so that it works for every data type the same way, and don't see any point in targeting vectors specifically.

@mykaul mykaul changed the title perf: optimize vector marshal/unmarshal for float32/float64/int32/int64 perf: optimize vector marshal/unmarshal for float32/float64/int32/int64 (Throughput: 75 MiB/s → 1.9 GiB/s (marshal), 106 MiB/s → 4.6 GiB/s (unmarshal) ) Apr 7, 2026
Add type-specialized fast paths for vector<float>, vector<double>,
vector<int>, and vector<bigint> that bypass reflect-based per-element
marshaling in favor of direct encoding/binary bulk conversion.

Changes in marshal.go:
- Type switches in marshalVector()/unmarshalVector() dispatch to
  dedicated functions for []float32, []float64, []int32, []int64
  before falling through to the generic reflect path.
- 8 new functions: marshalVectorFloat32, marshalVectorFloat64,
  unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32,
  marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64.
- sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf)
  for zero-alloc steady state when callers return buffers after
  the framer copies them. 64KiB cap prevents pool bloat.
- Unmarshal fast paths reuse destination slice backing array when
  capacity is sufficient (zero-alloc steady state on read path).
- Generic path preallocation via vectorFixedElemSize() + buf.Grow()
  for non-fast-path fixed-size types (e.g. UUID, timestamp).
- vectorByteSize() helper guards against integer overflow on 32-bit
  platforms with corrupt or adversarial schema metadata.
- All fast-path errors are wrapped as MarshalError/UnmarshalError
  for consistent error typing.
- dim=0 vectors correctly encode as non-nil empty values (not CQL NULL)
  in both fast paths and generic path.
- Negative dimensions are rejected with clear error messages.

Benchmark results for vector<float, 1536> (typical embedding dimension):

  Marshal (baseline -> optimized):
    86.4 us/op  ->  3.4 us/op  (25x faster)
    3081 allocs ->  2 allocs    (99.94% fewer)
    28632 B/op  ->  6172 B/op   (78% less memory)

  Marshal with pool return (steady state):
    86.4 us/op  ->  1.6 us/op  (54x faster)
    3081 allocs ->  2 allocs    (99.94% fewer)
    28632 B/op  ->  48 B/op     (99.8% less memory)

  Unmarshal (baseline -> optimized):
    60.2 us/op  ->  1.5 us/op  (41x faster)
    2 allocs    ->  0 allocs    (100% fewer)
    6168 B/op   ->  0 B/op      (100% less memory)

  Round-trip (baseline -> optimized, pooled):
    147.8 us/op ->  3.1 us/op  (48x faster)
    3083 allocs ->  2 allocs    (99.94% fewer)
    34800 B/op  ->  48 B/op     (99.9% less memory)

  Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%)

New test files:
- marshal_vector_test.go: 58+ unit subtests across 13 categories
  (round-trip, byte-compat, slice-reuse, nil, dimension-mismatch,
  empty-vector, pointer-to-slice, special-values, pool-concurrency,
  oversized-not-pooled, fixed-elem-size, generic-prealloc).
- vector_bench_test.go: extended with int32/int64 and pooled benchmarks.
- tests/bench/bench_vector_public_test.go: public API benchmarks for
  int32/int64 marshal/unmarshal.

Subsumes PR scylladb#744 (float fast paths) and PR scylladb#745 (generic prealloc).
Extends with int32/int64 fast paths and buffer pooling not covered by
any existing PR.
@mykaul
Copy link
Copy Markdown
Author

mykaul commented Apr 10, 2026

@mykaul , it is great idea to use pooled buffers, but i think we need to make it generic so that it works for every data type the same way, and don't see any point in targeting vectors specifically.

@dkropachev - just because they are large I targeted vectors. I can have a pool per type - or do you prefer one general pool for all types?

@mykaul mykaul force-pushed the vector-perf-optimize branch from 5e62e7c to ea4e0d7 Compare April 10, 2026 13:49
mykaul added 3 commits April 10, 2026 17:37
Return pooled vector buffers to vectorBufPool after the framer copies
marshalled bytes in executeQuery and executeBatch. This completes the
zero-alloc steady-state cycle for vector marshal operations.

In executeQuery, a defer after the marshal loop returns buffers for
columns identified as pooled vector types (float32, float64, int32,
int64). In executeBatch, vector buffers are collected across all batch
statements and returned via a single defer.

The vectorBufPoolSubtype helper centralizes the type check to keep
the two call sites consistent with the marshal fast paths.

Includes unit tests covering vectorBufPoolSubtype classification,
single-query and batch pool return simulation, and non-pooled type
safety.
Add dedicated marshal/unmarshal fast paths for UUID and TimeUUID vector
elements, following the same pattern as the existing float32/float64/
int32/int64 fast paths.

UUID is [16]byte with no endian conversion needed, so the fast path
uses a simple copy() loop. Uses pooled buffers via getVectorBuf for
zero-alloc steady state on the marshal path, and reuses the destination
slice backing array on the unmarshal path.

Benchmarks (vs generic reflection path):
- Marshal: ~90% faster (10x speedup), 99%+ fewer allocations
- Unmarshal: ~97% faster (30-35x speedup), zero allocations
- Marshal+pool: additional 4x over non-pooled marshal
…arse

VectorType embeds NativeType but had no NewWithError() method, so calls
fell through to NativeType.NewWithError() which hit the TypeCustom
fallback: goType() → asVectorType() → re-parse the full Java type string
(e.g. 'org.apache.cassandra.db.marshal.VectorType(FloatType, 1536)')
on every invocation. This is called per-column per-row by RowData() and
MapScan, making it a hot path for vector workloads.

Add VectorType.NewWithError() with fast paths for all common element
types (float32, float64, int32, int64, UUID, string, bool, etc.) that
return *[]T directly without reflection or string parsing. Fallback for
exotic subtypes still uses SubType.NewWithError() + reflect.SliceOf but
avoids the asVectorType() re-parse.

Also fix zero-dimension error messages in fast-path unmarshal functions
to be consistent with the generic path (check dim==0 before byte-size
validation), fix copyright header in marshal_vector_test.go, and fix
pre-existing session_unit_test.go build error from origin/master
(hostId string → UUID type mismatch).

Benchmark results (VectorType.NewWithError vs NativeType fallback):

  VectorType:          ~17 ns/op, 24 B/op, 1 allocs/op
  NativeType_fallback: ~170 ns/op, 92 B/op, 4 allocs/op

  → 10x faster, 75% fewer allocations
@mykaul mykaul force-pushed the vector-perf-optimize branch from 81dae0b to 66ed9aa Compare April 10, 2026 14:38
@mykaul
Copy link
Copy Markdown
Author

mykaul commented Apr 10, 2026

Analysis: Generalizing the Pooled-Buffer / Fast-Path Concept

Following up on @dkropachev's comment:

"it is great idea to use pooled buffers, but I think we need to make it generic so that it works for every data type the same way, and don't see any point in targeting vectors specifically."

Here is a detailed analysis of whether, how, and to what extent the two core optimizations in this PR can be generalized.


Two Separable Concepts

The PR contains two distinct optimizations that should be evaluated separately:

Concept What it does Generalizable?
A. sync.Pool buffer reuse Reuse marshal output []byte buffers instead of allocating fresh ones per Marshal() call Yes — universally
B. Type-specialized fast paths Bypass per-element reflect + Marshal() dispatch for known homogeneous fixed-size collections Partially — applies to list<T>, set<T>, vector<T>, and partially map<K,V>

Why Generalization is Justified

Every collection marshal function (marshalList at marshal.go:718, marshalMap at marshal.go:1072, marshalVector at marshal.go:866) shares the identical overhead pattern:

  1. Allocates a fresh bytes.Buffer{} per call (no pooling)
  2. Calls Marshal(subType, rv.Index(i).Interface()) per element — full type-switch dispatch + reflect.ValueOf + interface boxing + per-element []byte allocation
  3. Each element's intermediate []byte (e.g., 4 bytes for int, 8 bytes for bigint) is immediately dead after being copied into the buffer

For a list<int> of 1000 elements, this means: 1 bytes.Buffer struct allocation + 1000 reflection operations + 1000 interface boxing operations + 1000 × 4-byte []byte allocations. The exact same waste that vectors have.

marshalTuple (marshal.go:1297) and marshalUDT (marshal.go:1530) use var buf []byte + append instead of bytes.Buffer, but still have per-element Marshal() dispatch overhead.


Value Analysis

Concept A (Buffer Pool) — Impact by Type

Type Category Pool Value Rationale
list/set HIGH bytes.Buffer alloc eliminated; very common type
map HIGH Same, plus 2× element overhead (key+value)
vector HIGH Same (this PR's current target)
tuple / UDT MODERATE []byte growth via append; pooling helps pre-sizing
Scalars (standalone) LOW 4-8 byte allocations too small for pool overhead to help individually

Concept B (Fast Paths) — Projected Speedup

Based on this PR's measured vector results and architectural similarity:

Workload Current (est.) With Pool + Fast Path Improvement
vector<float, 1536> marshal ~54 µs ~1.3 µs ~41× (proven)
list<int> marshal, 1000 elems ~47 µs ~2-3 µs ~15-25× (projected)
list<float> unmarshal, 1000 elems ~39 µs ~1-2 µs ~20-35× (projected)
map<text,int> marshal, 100 entries ~15 µs ~5-8 µs ~2-3× (text keys limit gains)
Scalar int marshal (standalone) ~50 ns ~50 ns No change

Risk Assessment

Risk Severity Probability Mitigation
Data aliasing — pooled buffer reused while still referenced Critical Low Framer copies via append(f.buf, p...) before return. Verify no path holds reference past putBuf.
Pool leak — buffers not returned, growing GC pressure High Medium defer pattern; cover all exit paths in conn.go
Correctness regression in fast paths High Low PR #770's 58-subtest rigor (incl. -0, NaN) must extend to list/map fast paths. Byte-identical output vs. reflect path is the critical invariant.
Wire format divergence (Cassandra vs ScyllaDB) Medium Low For list/set/map, wire format is well-standardized. Lower risk than vector.
Increased code complexity Medium Certain Go generics (available: project targets Go 1.25) can reduce O(types²) explosion — the codebase already uses generics in internal/lru and internal/eventbus, and has 42 TODO comments across serialization/ packages noting "when generic-based serialization is introduced"

Complexity & Implementation Plan

Recommended: two-phase approach.

Phase 1: Generalized Buffer Pool (Low Risk, High Value) — ~100-150 LOC

  1. Replace vectorBufPool with a general-purpose marshalBufPool (sync.Pool of *bytes.Buffer or size-bucketed []byte pools matching the existing queryValuesPools pattern in frame.go:1227)
  2. Modify marshalList, marshalMap, marshalVector to use getMarshalBuf() instead of &bytes.Buffer{}
  3. Add buf.Grow() pre-sizing to marshalList and marshalMap (vector already has this via vectorFixedElemSize)
  4. Generalize putVectorBuf wiring in conn.go to return all marshaled queryValues.value buffers to the pool
  5. Rename vectorFixedElemSizefixedElemSize (it's not vector-specific) and add missing types (TypeCounter, TypeSmallInt, TypeTinyInt, TypeBoolean)

Phase 2: Generalized Fast Paths with Go Generics (Medium Risk, High Value) — ~300-400 LOC

  1. Define generic marshalCollectionFixed[T float32|float64|int32|int64](...) that bulk-serializes []T using encoding/binary without per-element reflection
  2. Add type-switch fast paths in marshalList/marshalSet before the reflect fallback
  3. For marshalMap, add fast paths for common key-value pairs (map[string]int32, etc.)
  4. Port vector fast paths from this PR to use the same generic infrastructure
  5. Keep VectorType.NewWithError() as-is (vector-specific, no generalization needed)

What should NOT be generalized:

  • VectorType.NewWithError() — only vectors have the goType()→asVectorType() re-parse overhead
  • Write-through to framer (this PR's "Phase 3" design note) — too invasive, ~20% marginal gain
  • Standalone scalar marshal pooling — 4-8 byte allocs are too small; the fast paths eliminate them when they're inside collection loops anyway

Test Coverage Requirements

Category Priority
Round-trip correctness: fast-path output → unmarshal → deep-equal Critical
Byte compatibility: fast-path output byte-identical to reflect-path Critical
Edge cases: nil, empty, single element, -0, NaN, MaxFloat, MinInt Critical
Pool lifecycle: no aliasing, cap guard, concurrent safety (race detector) High
Benchmarks: list<int>/list<float>/map<text,int> at sizes 10/100/1000/10000 High
Fallback correctness: non-fast-path types still work identically Medium
Custom named types (type MyFloat float32) fall through to reflect path Medium

Conclusion

@dkropachev is right that the buffer pooling concept should be generalized. The marshal lifecycle (Marshal() → store in queryValues.value → copy into framer.buf → buffer is dead) is identical for all types — there is nothing vector-specific about it.

The type-specialized fast paths are also generalizable to list/set/map, with the same 15-35× speedup potential for homogeneous fixed-size collections. Go generics (already used in internal/lru and internal/eventbus) can keep the code duplication manageable.

Suggested path forward:

  • Phase 1 (generalized pool) can be part of this PR or an immediate follow-up — it's low-risk and directly addresses the review feedback
  • Phase 2 (generalized fast paths) can be a separate PR building on Phase 1
  • The vector-specific optimizations in this PR remain valid as the highest-impact instance of the pattern (1536-element vectors dwarf typical 10-100 element lists)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants