perf: optimize vector marshal/unmarshal for float32/float64/int32/int64 (Throughput: 75 MiB/s → 1.9 GiB/s (marshal), 106 MiB/s → 4.6 GiB/s (unmarshal) ) by mykaul · Pull Request #770 · scylladb/gocql

mykaul · 2026-03-13T10:52:06Z

Summary

Type-specialized fast paths for vector<float>, vector<double>, vector<int>, vector<bigint>, and vector<uuid>/vector<timeuuid> that bypass reflect-based per-element marshaling in favor of direct encoding/binary bulk conversion, plus sync.Pool buffer reuse wired into the connection write path, and a VectorType.NewWithError() fast path that eliminates the expensive goType() → asVectorType() re-parse on every call.

Commit 1: `d527db1` `perf: optimize vector marshal/unmarshal for float32/float64/int32/int64`

Fast-path type switches, 8 dedicated marshal/unmarshal functions, sync.Pool infrastructure (getVectorBuf/putVectorBuf), unmarshal slice reuse, generic-path buf.Grow() preallocation via vectorFixedElemSize(), and comprehensive tests (58 subtests across 13 categories).

Commit 2: `04f0783` `perf: wire putVectorBuf into connection write path`

Adds defer putVectorBuf(...) calls in executeQuery() and executeBatch() in conn.go, so production callers return pooled marshal buffers after the framer copies them. This closes the pool lifecycle and achieves 48 B/op steady-state on the write path.

Commit 3: `d48f44f` `perf: add UUID/TimeUUID vector fast path`

Adds marshal/unmarshal fast paths for vector<uuid> and vector<timeuuid> — bulk copy() of fixed 16-byte elements with zero per-element allocations. UUID vectors are common in similarity search use cases (storing document IDs alongside embeddings).

Commit 4: `66ed9aa` `perf: add VectorType.NewWithError() to avoid goType/asVectorType re-parse`

VectorType embeds NativeType but had no NewWithError() method. When called, it dispatched to NativeType.NewWithError() which hit TypeCustom → goType() → asVectorType(), re-parsing the full Java type string on every call. The new method returns *[]SubType directly: 10.7x faster (181ns → 17ns), 75% fewer allocs (4 → 1), 74% less memory (92B → 24B).

Headline numbers (`vector<float, 1536>`, typical embedding dimension)

22x faster marshal (fast paths alone), 41x with pool recycling (see Pooled benchmarks)
36x faster unmarshal, zero allocations steady state
99.93% fewer allocations on marshal (3,074 → 2)
Marshal memory: 18,456 B/op → 6,172 B/op (fast paths) → 48 B/op (with pool recycling)
Unmarshal memory: 6,168 B/op → 0 B/op
10.7x faster NewWithError() for VectorType (181ns → 17ns)

Benchmark results

All benchmarks: 6 iterations, benchstat, all p=0.002. Machine: 12th Gen Intel Core i7-1270P.

Master = 3881f1e (origin/master), Optimized = 66ed9aa (this branch HEAD).

Latency (ns/op) — Master vs Optimized (all 4 commits)

Benchmark	Master	Optimized	Speedup
float32
MarshalVectorFloat32/dim_128	4,641	285	16.3x
MarshalVectorFloat32/dim_384	16,237	699	23.2x
MarshalVectorFloat32/dim_768	26,785	1,234	21.7x
MarshalVectorFloat32/dim_1536	54,049	2,415	22.4x
UnmarshalVectorFloat32/dim_128	3,116	95	32.8x
UnmarshalVectorFloat32/dim_384	10,647	280	38.0x
UnmarshalVectorFloat32/dim_768	18,536	586	31.6x
UnmarshalVectorFloat32/dim_1536	39,186	1,092	35.9x
float64
MarshalVectorFloat64/dim_128	4,700	380	12.4x
MarshalVectorFloat64/dim_384	13,125	985	13.3x
MarshalVectorFloat64/dim_768	29,809	1,905	15.6x
MarshalVectorFloat64/dim_1536	59,207	3,754	15.8x
int32
MarshalVectorInt32/dim_128	4,599	249	18.5x
MarshalVectorInt32/dim_384	13,560	654	20.7x
MarshalVectorInt32/dim_768	25,402	1,166	21.8x
MarshalVectorInt32/dim_1536	47,432	2,262	21.0x
UnmarshalVectorInt32/dim_128	3,201	94	34.1x
UnmarshalVectorInt32/dim_384	9,763	279	35.0x
UnmarshalVectorInt32/dim_768	19,700	547	36.0x
UnmarshalVectorInt32/dim_1536	40,106	1,073	37.4x
int64
MarshalVectorInt64/dim_128	4,643	368	12.6x
MarshalVectorInt64/dim_384	13,578	952	14.3x
MarshalVectorInt64/dim_768	26,887	1,834	14.7x
MarshalVectorInt64/dim_1536	62,726	3,636	17.3x
UnmarshalVectorInt64/dim_128	3,387	111	30.5x
UnmarshalVectorInt64/dim_384	9,854	343	28.7x
UnmarshalVectorInt64/dim_768	21,628	601	36.0x
UnmarshalVectorInt64/dim_1536	44,880	1,190	37.7x
UUID
MarshalVectorUUID/dim_128	8,154	710	11.5x
MarshalVectorUUID/dim_384	25,949	1,994	13.0x
MarshalVectorUUID/dim_768	52,738	3,861	13.7x
MarshalVectorUUID/dim_1536	95,785	7,480	12.8x
UnmarshalVectorUUID/dim_128	3,688	119	31.0x
UnmarshalVectorUUID/dim_384	11,164	330	33.8x
UnmarshalVectorUUID/dim_768	22,201	651	34.1x
UnmarshalVectorUUID/dim_1536	44,435	1,294	34.3x
NewWithError / RowData
VectorNewWithError/VectorType	181	17	10.7x
VectorNewWithError/NativeType_fallback	183	168	1.1x
RowDataWithVector	503	108	4.7x

Pool wiring benefit (Commit 2) — production write path

The table above measures Marshal()/Unmarshal() via the public API, which does not return buffers to the pool. In production, executeQuery()/executeBatch() return the buffer via putVectorBuf() after the framer copies it. The Pooled benchmarks simulate this:

Benchmark	Master	Fast-path only	Speedup vs master	+ Pool return	Speedup vs master
MarshalFloat32Pooled/dim_128	4,641	285	16.3x	161	28.8x
MarshalFloat32Pooled/dim_384	16,237	699	23.2x	369	44.0x
MarshalFloat32Pooled/dim_768	26,785	1,234	21.7x	679	39.4x
MarshalFloat32Pooled/dim_1536	54,049	2,415	22.4x	1,306	41.4x
MarshalInt32Pooled/dim_1536	47,432	2,262	21.0x	1,100	43.1x
MarshalInt64Pooled/dim_1536	62,726	3,636	17.3x	1,299	48.3x
MarshalUUIDPooled/dim_1536	95,785	7,480	12.8x	1,440	66.5x

Memory with pool return is 48 B/op constant, regardless of vector dimension or element type (from sync.Pool interface boxing overhead, irreducible). Compare to master: 18,456 B/op for float32/dim_1536, 98,328 B/op for UUID/dim_1536.

Full benchstat details: memory and allocations (click to expand)

Memory (B/op)

Benchmark	Master	Optimized	Change
RowDataWithVector	216	144	-33.33%
UnmarshalVectorFloat32/dim_128	536	0	-100.00%
UnmarshalVectorFloat32/dim_384	1,560	0	-100.00%
UnmarshalVectorFloat32/dim_768	3,096	0	-100.00%
UnmarshalVectorFloat32/dim_1536	6,168	0	-100.00%
MarshalVectorFloat32/dim_128	1,560	536	-65.64%
MarshalVectorFloat32/dim_384	4,632	1,561	-66.30%
MarshalVectorFloat32/dim_768	9,240	3,098	-66.47%
MarshalVectorFloat32/dim_1536	18,456	6,172	-66.55%
MarshalVectorFloat64/dim_128	3,096	1,048	-66.15%
MarshalVectorFloat64/dim_384	9,240	3,098	-66.47%
MarshalVectorFloat64/dim_768	18,456	6,173	-66.55%
MarshalVectorFloat64/dim_1536	36,888	12,319	-66.60%
MarshalVectorInt32/dim_128	1,560	536	-65.64%
MarshalVectorInt32/dim_384	4,632	1,561	-66.30%
MarshalVectorInt32/dim_768	9,240	3,098	-66.47%
MarshalVectorInt32/dim_1536	18,456	6,172	-66.55%
MarshalVectorInt64/dim_128	3,096	1,048	-66.15%
MarshalVectorInt64/dim_384	9,240	3,098	-66.47%
MarshalVectorInt64/dim_768	18,456	6,173	-66.55%
MarshalVectorInt64/dim_1536	36,888	12,319	-66.60%
MarshalVectorUUID/dim_128	8,216	2,072	-74.77%
MarshalVectorUUID/dim_384	24,600	6,172	-74.91%
MarshalVectorUUID/dim_768	49,176	12,312	-74.94%
MarshalVectorUUID/dim_1536	98,328	24,600	-74.96%
UnmarshalVectorInt32 (all dims)	536–6,168	0	-100.00%
UnmarshalVectorInt64 (all dims)	1,048–12,312	0	-100.00%
UnmarshalVectorUUID (all dims)	2,072–24,600	0	-100.00%
VectorNewWithError/VectorType	92	24	-73.91%

Allocations (allocs/op)

Benchmark	Master	Optimized	Change
RowDataWithVector	8	5	-37.50%
MarshalVectorFloat32/dim_128	258	2	-99.22%
MarshalVectorFloat32/dim_384	770	2	-99.74%
MarshalVectorFloat32/dim_768	1,538	2	-99.87%
MarshalVectorFloat32/dim_1536	3,074	2	-99.93%
MarshalVectorFloat64 (all dims)	258–3,074	2	-99.22% to -99.93%
MarshalVectorInt32 (all dims)	258–3,074	2	-99.22% to -99.93%
MarshalVectorInt64 (all dims)	258–3,074	2	-99.22% to -99.93%
MarshalVectorUUID/dim_128	386	2	-99.48%
MarshalVectorUUID/dim_384	1,154	2	-99.83%
MarshalVectorUUID/dim_768	2,306	2	-99.91%
MarshalVectorUUID/dim_1536	4,610	2	-99.96%
All unmarshal (all types, all dims)	2	0	-100.00%
VectorNewWithError/VectorType	4	1	-75.00%

Pooled marshal memory (B/op) — with pool return

Benchmark	B/op
MarshalFloat32Pooled (all dims)	48
MarshalInt32Pooled (all dims)	48
MarshalInt64Pooled (all dims)	48
MarshalUUIDPooled (all dims)	48

What changed

`marshal.go`

Fast-path type switches in marshalVector() and unmarshalVector() — before the existing reflect-based generic path, a switch on info.SubType.Type() intercepts []float32, []float64, []int32, []int64, []UUID and dispatches to 10 dedicated functions. Falls through to the generic path for all other types.
10 new marshal/unmarshal functions — marshalVectorFloat32/Float64/Int32/Int64/UUID and corresponding unmarshalVector*. Float/int functions use encoding/binary.BigEndian.PutUint32/PutUint64 with math.Float32bits/Float64bits. UUID functions use bulk copy() of 16-byte elements.
sync.Pool buffer reuse — vectorBufPool, getVectorBuf(size), putVectorBuf(buf) with 64 KiB cap guard.
Unmarshal slice reuse — All unmarshal fast paths reuse the destination slice's backing array when capacity is sufficient, achieving zero allocations on repeated reads.
Generic path preallocation — vectorFixedElemSize() returns the wire-format byte size for fixed-length CQL types. The generic path calls buf.Grow() upfront.
VectorType.NewWithError() — Returns *[]SubType directly without going through goType() → asVectorType(). 10.7x faster, 75% fewer allocs.

`conn.go`

Pool return wiring — executeQuery() and executeBatch() call defer putVectorBuf(...) on each queryValues.value, closing the pool lifecycle.

Test files

marshal_vector_test.go — 58 unit subtests across 13 categories including UUID-specific tests
vector_bench_test.go — Benchmarks for all 5 types: marshal, pooled marshal, unmarshal, across dimensions 128/384/768/1536
marshal_test.go — TestVectorNewWithErrorConsistentWithGoType, TestVectorNewWithErrorReturnsSlicePointer
helpers_bench_test.go — BenchmarkVectorNewWithError, BenchmarkRowDataWithVector

How the bottleneck was eliminated

The original generic path for vector<float, 1536>:

Called Marshal() 1,536 times through reflect dispatch
Each call allocated a 4-byte []byte via encFloat32 (1,536 allocs)
Appended each to a bytes.Buffer that grew incrementally (additional allocs)
The buffer was then copied into framer.buf by writeBytes()

The fast path:

Single getVectorBuf(6144) (pooled, zero-alloc steady state)
Tight loop: binary.BigEndian.PutUint32(buf[i*4:], math.Float32bits(v))
No reflect, no per-element dispatch, no intermediate allocations
putVectorBuf() returns the buffer to the pool after c.exec()

Design decisions

Phase 3 (write-through to framer) was deliberately skipped — encoding directly into framer.buf would save ~0.4 µs (~21% marginal over the pooled path) but requires invasive changes to the queryValues struct used by all query paths. The risk/reward ratio is unfavorable.
Phase 4 (fix isVectorVariableLengthType) was deliberately skipped — there is a discrepancy in how some types are handled between Cassandra and ScyllaDB implementations. We focus only on the 5 types where the wire format is unambiguous.

Relationship to open PRs

Replaces PR #744 (float fast paths) and PR #745 (generic prealloc)

This PR is a strict superset of both:

Feature	PR #744	PR #745	This PR
Float32/Float64 fast marshal	yes	—	yes
Float32/Float64 fast unmarshal + slice reuse	yes	—	yes
Int32/Int64 fast marshal/unmarshal	—	—	yes
UUID/TimeUUID fast marshal/unmarshal	—	—	yes
`sync.Pool` buffer reuse	—	—	yes
Pool wiring in `conn.go`	—	—	yes
`VectorType.NewWithError()`	—	—	yes
`vectorFixedElemSize()` helper	—	yes	yes
Generic `buf.Grow()` prealloc	—	yes	yes

If this PR merges first, #744 and #745 become no-ops and should be closed.

Orthogonal to PRs #751, #752, #753

These PRs reduce per-request allocations in the connection/framing layer. This PR reduces allocations in the marshal/unmarshal layer. They are fully complementary — different files, different allocation sites, additive benefits.

Note: PR #749 (pool write-side framers) was closed — fully superseded by 3e1e7e4 on master.

Depends on PR #838

PR #838 fixes a pre-existing build failure in session_unit_test.go where hostId changed from string to UUID but test literals were not updated. This branch carries the same fix; once #838 merges, the fix becomes a no-op on rebase.

Copilot

Pull request overview

This PR introduces high-performance, type-specialized marshal/unmarshal paths for common numeric vector element types to reduce reflect overhead and allocations in the gocql CQL codec layer.

Changes:

Added fast paths in marshalVector / unmarshalVector for []float32, []float64, []int32, []int64 plus a sync.Pool-backed buffer helper for marshal-side reuse.
Added a generic-path preallocation helper (vectorFixedElemSize + buf.Grow) to reduce bytes.Buffer growth for fixed-size element types.
Added extensive unit tests for vector behavior and expanded internal/public benchmarks for new int32/int64 vector cases and pooled scenarios.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
marshal.go	Adds vector fast paths, vector buffer pool helpers, and generic vector preallocation support.
marshal_vector_test.go	New comprehensive unit tests for vector fast paths, pooling helpers, and preallocation behavior.
vector_bench_test.go	Adds/extends internal benchmarks for pooled marshal and int32/int64 vector performance.
tests/bench/bench_vector_public_test.go	Extends public API benchmarks to cover int32/int64 vector marshal/unmarshal via `gocql.Marshal/Unmarshal`.

Copilot

Pull request overview

Introduces type-specialized fast paths for vector marshal/unmarshal (float32/float64/int32/int64) in marshal.go to avoid reflect-heavy per-element encoding, plus expanded benchmarks and a new, comprehensive unit-test suite to validate correctness and performance characteristics.

Changes:

Add fast-path vector marshal/unmarshal implementations using encoding/binary bulk conversion and destination-slice reuse.
Add sync.Pool-backed buffer helpers (getVectorBuf/putVectorBuf) and generic-path preallocation via vectorFixedElemSize.
Expand internal and public benchmarks; add a large new unit test file covering many vector edge cases.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
`marshal.go`	Adds specialized vector marshal/unmarshal fast paths, pooling helpers, and generic-path preallocation.
`marshal_vector_test.go`	New unit tests for vector behavior (round-trip, byte-compat, slice reuse, pool behavior, etc.).
`vector_bench_test.go`	Adds pooled/write-path/round-trip benchmarks plus int32/int64 benchmark coverage.
`tests/bench/bench_vector_public_test.go`	Extends public API benchmarks to cover int32/int64 vectors.

Copilot

Pull request overview

This PR accelerates CQL vector encoding/decoding in the GoCQL driver by introducing type-specialized marshal/unmarshal fast paths for common numeric vector element types, reducing reflection overhead and allocations on hot paths.

Changes:

Add specialized marshal/unmarshal implementations for []float32, []float64, []int32, []int64 using encoding/binary + bit conversions, with unmarshal slice reuse.
Introduce vectorBufPool (sync.Pool) helpers for reusable marshal buffers and add generic-path preallocation via vectorFixedElemSize() + bytes.Buffer.Grow().
Add extensive unit tests for vector behavior and expand internal + public benchmarks for the new fast paths.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.

File	Description
marshal.go	Adds vector fast paths, buffer pooling helpers, and generic-path preallocation helper.
marshal_vector_test.go	New unit test suite covering round-trip, compatibility, reuse, pool behavior, and prealloc.
vector_bench_test.go	Adds pooled/unpooled benchmarks and a simulated write-path benchmark for vectors.
tests/bench/bench_vector_public_test.go	Extends public API benchmarks to cover int32/int64 vectors.

Copilot

Pull request overview

This PR adds specialized, non-reflect fast paths for marshaling/unmarshaling common vector<> element types in the GoCQL driver to significantly reduce allocations and improve throughput, along with extensive tests and expanded benchmarks.

Changes:

Add type-specialized vector marshal/unmarshal implementations for []float32, []float64, []int32, and []int64, plus a pooled []byte buffer facility for marshal fast paths.
Improve generic vector marshal performance via preallocation (buf.Grow) when element wire size is known.
Add a comprehensive unit test suite for vector behavior and extend internal + public benchmarks for the new int32/int64 paths and pooled scenarios.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.

File	Description
`marshal.go`	Adds vector fast paths, buffer pooling helpers, overflow guards, and generic-path preallocation.
`marshal_vector_test.go`	New comprehensive unit tests covering fast paths, edge cases, and pool behavior.
`vector_bench_test.go`	Adds pooled/write-path/round-trip benchmarks and int32/int64 benchmarks.
`tests/bench/bench_vector_public_test.go`	Extends public API benchmarks to cover int32/int64 vectors.

Copilot

Pull request overview

This PR introduces type-specialized fast paths for marshaling/unmarshaling common fixed-width vector element types (float32/float64/int32/int64) to avoid reflect-based per-element work, reducing allocations and significantly improving throughput in the driver’s vector serialization layer.

Changes:

Added fast-path dispatch in marshalVector/unmarshalVector with dedicated bulk encode/decode implementations for float32/float64/int32/int64 vectors.
Introduced a sync.Pool-backed byte buffer reuse mechanism for vector marshaling and slice-backing reuse for unmarshaling.
Added extensive unit tests plus expanded internal/public benchmarks for the new fast paths and pooled usage patterns.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
marshal.go	Adds fast-path vector marshal/unmarshal implementations, buffer pooling helpers, and generic-path preallocation.
marshal_vector_test.go	Adds a comprehensive unit test suite for vector behavior, pooling, edge cases, and compatibility.
vector_bench_test.go	Adds/extends internal benchmarks for pooled marshal, write-path simulation, and int vector types.
tests/bench/bench_vector_public_test.go	Extends public API benchmarks to cover int32/int64 vector marshal/unmarshal.

mykaul · 2026-03-17T18:04:36Z

Addressed review feedback:

Fixed (this push):

getVectorBuf(0) now returns non-nil empty slice instead of nil. Previously, marshaling a non-nil empty vector ([]float32{}) with dim==0 would return nil, which framer.writeBytes encodes as CQL NULL. Now it correctly returns make([]byte, 0), distinguishing empty vectors from NULL.
Added dim==0 array validation: When Dimensions==0 and the destination is *[N]T where N!=0, we now return an error ("array of size N cannot store vector of 0 dimensions") instead of silently succeeding and leaving the array unchanged.
Strengthened empty-vector tests: All 4 TestMarshalVector_EmptyVector subtests now assert data != nil in addition to len(data) == 0, catching the nil-vs-empty distinction.

Already fixed in prior revisions:

Magic number 0x0015 → uint16(TypeDuration) (already done)
Displaced isVectorVariableLengthType doc comment (already adjacent to function)
-0 sign bit preservation tests (already have explicit Float32bits/Float64bits checks)
dim==0 unmarshal fast-paths already return non-nil empty slices via make([]float32, 0) etc.

Not a bug (Copilot false positives):

&result[:1][0] on make([]float32, 0, dim+10): This does NOT panic. Go allows reslicing up to capacity, so result[:1] is valid when cap(result) >= 1.
dim * 4 overflow: Already handled by vectorByteSize() which uses int64 arithmetic and checks for overflow.
vectorByteSize returning fmt.Errorf: All callers already wrap the error with marshalErrorf/unmarshalErrorf, so the final error type is correct.

Not changing (style preference):

readColWithSpec high-arity signature: Internal function with 2 call sites in the same file. A struct would add indirection without clear benefit.
Missing fixed-size types in vectorFixedElemSize: The function covers the types that have vector fast-paths. Other types fall through to the generic path which handles them correctly.

Copilot

Pull request overview

Adds type-specialized marshal/unmarshal fast paths for common vector<...> element types to significantly reduce reflection overhead and allocations in the driver’s value encoding/decoding layer.

Changes:

Introduces specialized marshal/unmarshal implementations for []float32, []float64, []int32, []int64 plus a sync.Pool-backed buffer helper for marshal fast paths.
Adds generic-path preallocation for fixed-size vector element types and improves 0-dimension handling in unmarshal.
Expands benchmarks and adds a comprehensive new unit test suite for vector behavior/performance characteristics.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
`marshal.go`	Adds vector fast paths, pooled buffer helpers, fixed-element-size prealloc, and 0-dimension unmarshal handling.
`marshal_vector_test.go`	New, extensive unit tests for vector marshal/unmarshal correctness, edge cases, and pooling behavior.
`vector_bench_test.go`	Extends internal benchmarks to cover pooled write-path simulations and int32/int64 vectors.
`tests/bench/bench_vector_public_test.go`	Extends public API benchmarks to include int32/int64 vector marshal/unmarshal.

dkropachev · 2026-04-05T22:00:58Z

@mykaul , it is great idea to use pooled buffers, but i think we need to make it generic so that it works for every data type the same way, and don't see any point in targeting vectors specifically.

Add type-specialized fast paths for vector<float>, vector<double>, vector<int>, and vector<bigint> that bypass reflect-based per-element marshaling in favor of direct encoding/binary bulk conversion. Changes in marshal.go: - Type switches in marshalVector()/unmarshalVector() dispatch to dedicated functions for []float32, []float64, []int32, []int64 before falling through to the generic reflect path. - 8 new functions: marshalVectorFloat32, marshalVectorFloat64, unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32, marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64. - sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf) for zero-alloc steady state when callers return buffers after the framer copies them. 64KiB cap prevents pool bloat. - Unmarshal fast paths reuse destination slice backing array when capacity is sufficient (zero-alloc steady state on read path). - Generic path preallocation via vectorFixedElemSize() + buf.Grow() for non-fast-path fixed-size types (e.g. UUID, timestamp). - vectorByteSize() helper guards against integer overflow on 32-bit platforms with corrupt or adversarial schema metadata. - All fast-path errors are wrapped as MarshalError/UnmarshalError for consistent error typing. - dim=0 vectors correctly encode as non-nil empty values (not CQL NULL) in both fast paths and generic path. - Negative dimensions are rejected with clear error messages. Benchmark results for vector<float, 1536> (typical embedding dimension): Marshal (baseline -> optimized): 86.4 us/op -> 3.4 us/op (25x faster) 3081 allocs -> 2 allocs (99.94% fewer) 28632 B/op -> 6172 B/op (78% less memory) Marshal with pool return (steady state): 86.4 us/op -> 1.6 us/op (54x faster) 3081 allocs -> 2 allocs (99.94% fewer) 28632 B/op -> 48 B/op (99.8% less memory) Unmarshal (baseline -> optimized): 60.2 us/op -> 1.5 us/op (41x faster) 2 allocs -> 0 allocs (100% fewer) 6168 B/op -> 0 B/op (100% less memory) Round-trip (baseline -> optimized, pooled): 147.8 us/op -> 3.1 us/op (48x faster) 3083 allocs -> 2 allocs (99.94% fewer) 34800 B/op -> 48 B/op (99.9% less memory) Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%) New test files: - marshal_vector_test.go: 58+ unit subtests across 13 categories (round-trip, byte-compat, slice-reuse, nil, dimension-mismatch, empty-vector, pointer-to-slice, special-values, pool-concurrency, oversized-not-pooled, fixed-elem-size, generic-prealloc). - vector_bench_test.go: extended with int32/int64 and pooled benchmarks. - tests/bench/bench_vector_public_test.go: public API benchmarks for int32/int64 marshal/unmarshal. Subsumes PR scylladb#744 (float fast paths) and PR scylladb#745 (generic prealloc). Extends with int32/int64 fast paths and buffer pooling not covered by any existing PR.

mykaul · 2026-04-10T10:49:43Z

@mykaul , it is great idea to use pooled buffers, but i think we need to make it generic so that it works for every data type the same way, and don't see any point in targeting vectors specifically.

@dkropachev - just because they are large I targeted vectors. I can have a pool per type - or do you prefer one general pool for all types?

Return pooled vector buffers to vectorBufPool after the framer copies marshalled bytes in executeQuery and executeBatch. This completes the zero-alloc steady-state cycle for vector marshal operations. In executeQuery, a defer after the marshal loop returns buffers for columns identified as pooled vector types (float32, float64, int32, int64). In executeBatch, vector buffers are collected across all batch statements and returned via a single defer. The vectorBufPoolSubtype helper centralizes the type check to keep the two call sites consistent with the marshal fast paths. Includes unit tests covering vectorBufPoolSubtype classification, single-query and batch pool return simulation, and non-pooled type safety.

Add dedicated marshal/unmarshal fast paths for UUID and TimeUUID vector elements, following the same pattern as the existing float32/float64/ int32/int64 fast paths. UUID is [16]byte with no endian conversion needed, so the fast path uses a simple copy() loop. Uses pooled buffers via getVectorBuf for zero-alloc steady state on the marshal path, and reuses the destination slice backing array on the unmarshal path. Benchmarks (vs generic reflection path): - Marshal: ~90% faster (10x speedup), 99%+ fewer allocations - Unmarshal: ~97% faster (30-35x speedup), zero allocations - Marshal+pool: additional 4x over non-pooled marshal

…arse VectorType embeds NativeType but had no NewWithError() method, so calls fell through to NativeType.NewWithError() which hit the TypeCustom fallback: goType() → asVectorType() → re-parse the full Java type string (e.g. 'org.apache.cassandra.db.marshal.VectorType(FloatType, 1536)') on every invocation. This is called per-column per-row by RowData() and MapScan, making it a hot path for vector workloads. Add VectorType.NewWithError() with fast paths for all common element types (float32, float64, int32, int64, UUID, string, bool, etc.) that return *[]T directly without reflection or string parsing. Fallback for exotic subtypes still uses SubType.NewWithError() + reflect.SliceOf but avoids the asVectorType() re-parse. Also fix zero-dimension error messages in fast-path unmarshal functions to be consistent with the generic path (check dim==0 before byte-size validation), fix copyright header in marshal_vector_test.go, and fix pre-existing session_unit_test.go build error from origin/master (hostId string → UUID type mismatch). Benchmark results (VectorType.NewWithError vs NativeType fallback): VectorType: ~17 ns/op, 24 B/op, 1 allocs/op NativeType_fallback: ~170 ns/op, 92 B/op, 4 allocs/op → 10x faster, 75% fewer allocations

mykaul · 2026-04-10T20:51:26Z

Analysis: Generalizing the Pooled-Buffer / Fast-Path Concept

Following up on @dkropachev's comment:

"it is great idea to use pooled buffers, but I think we need to make it generic so that it works for every data type the same way, and don't see any point in targeting vectors specifically."

Here is a detailed analysis of whether, how, and to what extent the two core optimizations in this PR can be generalized.

Two Separable Concepts

The PR contains two distinct optimizations that should be evaluated separately:

Concept	What it does	Generalizable?
A. `sync.Pool` buffer reuse	Reuse marshal output `[]byte` buffers instead of allocating fresh ones per `Marshal()` call	Yes — universally
B. Type-specialized fast paths	Bypass per-element `reflect` + `Marshal()` dispatch for known homogeneous fixed-size collections	Partially — applies to `list<T>`, `set<T>`, `vector<T>`, and partially `map<K,V>`

Why Generalization is Justified

Every collection marshal function (marshalList at marshal.go:718, marshalMap at marshal.go:1072, marshalVector at marshal.go:866) shares the identical overhead pattern:

Allocates a fresh bytes.Buffer{} per call (no pooling)
Calls Marshal(subType, rv.Index(i).Interface()) per element — full type-switch dispatch + reflect.ValueOf + interface boxing + per-element []byte allocation
Each element's intermediate []byte (e.g., 4 bytes for int, 8 bytes for bigint) is immediately dead after being copied into the buffer

For a list<int> of 1000 elements, this means: 1 bytes.Buffer struct allocation + 1000 reflection operations + 1000 interface boxing operations + 1000 × 4-byte []byte allocations. The exact same waste that vectors have.

marshalTuple (marshal.go:1297) and marshalUDT (marshal.go:1530) use var buf []byte + append instead of bytes.Buffer, but still have per-element Marshal() dispatch overhead.

Value Analysis

Concept A (Buffer Pool) — Impact by Type

Type Category	Pool Value	Rationale
list/set	HIGH	`bytes.Buffer` alloc eliminated; very common type
map	HIGH	Same, plus 2× element overhead (key+value)
vector	HIGH	Same (this PR's current target)
tuple / UDT	MODERATE	`[]byte` growth via `append`; pooling helps pre-sizing
Scalars (standalone)	LOW	4-8 byte allocations too small for pool overhead to help individually

Concept B (Fast Paths) — Projected Speedup

Based on this PR's measured vector results and architectural similarity:

Workload	Current (est.)	With Pool + Fast Path	Improvement
`vector<float, 1536>` marshal	~54 µs	~1.3 µs	~41× (proven)
`list<int>` marshal, 1000 elems	~47 µs	~2-3 µs	~15-25× (projected)
`list<float>` unmarshal, 1000 elems	~39 µs	~1-2 µs	~20-35× (projected)
`map<text,int>` marshal, 100 entries	~15 µs	~5-8 µs	~2-3× (text keys limit gains)
Scalar `int` marshal (standalone)	~50 ns	~50 ns	No change

Risk Assessment

Risk	Severity	Probability	Mitigation
Data aliasing — pooled buffer reused while still referenced	Critical	Low	Framer copies via `append(f.buf, p...)` before return. Verify no path holds reference past `putBuf`.
Pool leak — buffers not returned, growing GC pressure	High	Medium	`defer` pattern; cover all exit paths in `conn.go`
Correctness regression in fast paths	High	Low	PR #770's 58-subtest rigor (incl. `-0`, NaN) must extend to list/map fast paths. Byte-identical output vs. reflect path is the critical invariant.
Wire format divergence (Cassandra vs ScyllaDB)	Medium	Low	For list/set/map, wire format is well-standardized. Lower risk than vector.
Increased code complexity	Medium	Certain	Go generics (available: project targets Go 1.25) can reduce O(types²) explosion — the codebase already uses generics in `internal/lru` and `internal/eventbus`, and has 42 TODO comments across `serialization/` packages noting "when generic-based serialization is introduced"

Complexity & Implementation Plan

Recommended: two-phase approach.

Phase 1: Generalized Buffer Pool (Low Risk, High Value) — ~100-150 LOC

Replace vectorBufPool with a general-purpose marshalBufPool (sync.Pool of *bytes.Buffer or size-bucketed []byte pools matching the existing queryValuesPools pattern in frame.go:1227)
Modify marshalList, marshalMap, marshalVector to use getMarshalBuf() instead of &bytes.Buffer{}
Add buf.Grow() pre-sizing to marshalList and marshalMap (vector already has this via vectorFixedElemSize)
Generalize putVectorBuf wiring in conn.go to return all marshaled queryValues.value buffers to the pool
Rename vectorFixedElemSize → fixedElemSize (it's not vector-specific) and add missing types (TypeCounter, TypeSmallInt, TypeTinyInt, TypeBoolean)

Phase 2: Generalized Fast Paths with Go Generics (Medium Risk, High Value) — ~300-400 LOC

Define generic marshalCollectionFixed[T float32|float64|int32|int64](...) that bulk-serializes []T using encoding/binary without per-element reflection
Add type-switch fast paths in marshalList/marshalSet before the reflect fallback
For marshalMap, add fast paths for common key-value pairs (map[string]int32, etc.)
Port vector fast paths from this PR to use the same generic infrastructure
Keep VectorType.NewWithError() as-is (vector-specific, no generalization needed)

What should NOT be generalized:

VectorType.NewWithError() — only vectors have the goType()→asVectorType() re-parse overhead
Write-through to framer (this PR's "Phase 3" design note) — too invasive, ~20% marginal gain
Standalone scalar marshal pooling — 4-8 byte allocs are too small; the fast paths eliminate them when they're inside collection loops anyway

Test Coverage Requirements

Category	Priority
Round-trip correctness: fast-path output → unmarshal → deep-equal	Critical
Byte compatibility: fast-path output byte-identical to reflect-path	Critical
Edge cases: nil, empty, single element, -0, NaN, MaxFloat, MinInt	Critical
Pool lifecycle: no aliasing, cap guard, concurrent safety (race detector)	High
Benchmarks: `list<int>`/`list<float>`/`map<text,int>` at sizes 10/100/1000/10000	High
Fallback correctness: non-fast-path types still work identically	Medium
Custom named types (`type MyFloat float32`) fall through to reflect path	Medium

Conclusion

@dkropachev is right that the buffer pooling concept should be generalized. The marshal lifecycle (Marshal() → store in queryValues.value → copy into framer.buf → buffer is dead) is identical for all types — there is nothing vector-specific about it.

The type-specialized fast paths are also generalizable to list/set/map, with the same 15-35× speedup potential for homogeneous fixed-size collections. Go generics (already used in internal/lru and internal/eventbus) can keep the code duplication manageable.

Suggested path forward:

Phase 1 (generalized pool) can be part of this PR or an immediate follow-up — it's low-risk and directly addresses the review feedback
Phase 2 (generalized fast paths) can be a separate PR building on Phase 1
The vector-specific optimizations in this PR remain valid as the highest-impact instance of the pattern (1536-element vectors dwarf typical 10-100 element lists)

mykaul marked this pull request as draft March 13, 2026 11:52

mykaul requested a review from Copilot March 13, 2026 11:52

Copilot started reviewing on behalf of mykaul March 13, 2026 11:52 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

Comment thread marshal.go Outdated

Comment thread marshal.go

Comment thread marshal_vector_test.go

Comment thread marshal_vector_test.go

mykaul force-pushed the vector-perf-optimize branch from f6229cb to c9ea393 Compare March 13, 2026 12:34

mykaul requested a review from Copilot March 13, 2026 12:36

Copilot started reviewing on behalf of mykaul March 13, 2026 12:36 View session

Copilot AI reviewed Mar 13, 2026

View reviewed changes

Comment thread marshal.go Outdated

Comment thread marshal_vector_test.go Outdated

Comment thread marshal_vector_test.go Outdated

Comment thread marshal_vector_test.go Outdated

Comment thread marshal_vector_test.go Outdated

mykaul force-pushed the vector-perf-optimize branch from c9ea393 to 79698fb Compare March 13, 2026 18:03

mykaul mentioned this pull request Mar 13, 2026

fix: guard against divide-by-zero in unmarshalVector when Dimensions is 0 #771

Closed

mykaul requested a review from Copilot March 15, 2026 08:18

Copilot started reviewing on behalf of mykaul March 15, 2026 08:19 View session

Copilot AI reviewed Mar 15, 2026

View reviewed changes

mykaul requested a review from Copilot March 15, 2026 17:03

Copilot started reviewing on behalf of mykaul March 15, 2026 17:04 View session

Copilot AI reviewed Mar 15, 2026

View reviewed changes

Comment thread marshal.go

Comment thread marshal.go Outdated

Comment thread marshal.go

Comment thread marshal.go

Comment thread marshal.go

Comment thread marshal.go

mykaul requested a review from Copilot March 16, 2026 18:06

Copilot started reviewing on behalf of mykaul March 16, 2026 18:07 View session

Copilot AI reviewed Mar 16, 2026

View reviewed changes

Comment thread marshal_vector_test.go

Comment thread marshal.go Outdated

Comment thread marshal.go

mykaul force-pushed the vector-perf-optimize branch from 9e5efc1 to 234f670 Compare March 17, 2026 18:04

mykaul requested a review from Copilot March 20, 2026 21:32

Copilot started reviewing on behalf of mykaul March 20, 2026 21:32 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

Comment thread marshal_vector_test.go Outdated

Comment thread marshal.go

Comment thread marshal_vector_test.go Outdated

Comment thread marshal_vector_test.go Outdated

Comment thread marshal_vector_test.go Outdated

mykaul force-pushed the vector-perf-optimize branch from b0a2d82 to 118e06c Compare March 24, 2026 18:21

mykaul force-pushed the vector-perf-optimize branch 2 times, most recently from 432f624 to 9410b42 Compare April 4, 2026 11:54

mykaul changed the title ~~perf: optimize vector marshal/unmarshal for float32/float64/int32/int64~~ perf: optimize vector marshal/unmarshal for float32/float64/int32/int64 (Throughput: 75 MiB/s → 1.9 GiB/s (marshal), 106 MiB/s → 4.6 GiB/s (unmarshal) ) Apr 7, 2026

mykaul force-pushed the vector-perf-optimize branch from 5e62e7c to ea4e0d7 Compare April 10, 2026 13:49

mykaul added 3 commits April 10, 2026 17:37

mykaul force-pushed the vector-perf-optimize branch from 81dae0b to 66ed9aa Compare April 10, 2026 14:38

mykaul mentioned this pull request Apr 10, 2026

perf: generalize buffer pool for collection/vector marshal #839

Draft

Conversation

mykaul commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commit 1: d527db1 perf: optimize vector marshal/unmarshal for float32/float64/int32/int64

Commit 2: 04f0783 perf: wire putVectorBuf into connection write path

Commit 3: d48f44f perf: add UUID/TimeUUID vector fast path

Commit 4: 66ed9aa perf: add VectorType.NewWithError() to avoid goType/asVectorType re-parse

Headline numbers (vector<float, 1536>, typical embedding dimension)

Benchmark results

Latency (ns/op) — Master vs Optimized (all 4 commits)

Pool wiring benefit (Commit 2) — production write path

Memory (B/op)

Allocations (allocs/op)

Pooled marshal memory (B/op) — with pool return

What changed

marshal.go

conn.go

Test files

How the bottleneck was eliminated

Design decisions

Relationship to open PRs

Replaces PR #744 (float fast paths) and PR #745 (generic prealloc)

Orthogonal to PRs #751, #752, #753

Depends on PR #838

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mykaul commented Mar 17, 2026

Uh oh!

mykaul commented Mar 13, 2026 •

edited

Loading

Commit 1: `d527db1` `perf: optimize vector marshal/unmarshal for float32/float64/int32/int64`

Commit 2: `04f0783` `perf: wire putVectorBuf into connection write path`

Commit 3: `d48f44f` `perf: add UUID/TimeUUID vector fast path`

Commit 4: `66ed9aa` `perf: add VectorType.NewWithError() to avoid goType/asVectorType re-parse`

Headline numbers (`vector<float, 1536>`, typical embedding dimension)

`marshal.go`

`conn.go`