perf: optimize vector marshal/unmarshal for float32/float64/int32/int64 (Throughput: 75 MiB/s → 1.9 GiB/s (marshal), 106 MiB/s → 4.6 GiB/s (unmarshal) )#770
Conversation
There was a problem hiding this comment.
Pull request overview
This PR introduces high-performance, type-specialized marshal/unmarshal paths for common numeric vector element types to reduce reflect overhead and allocations in the gocql CQL codec layer.
Changes:
- Added fast paths in
marshalVector/unmarshalVectorfor[]float32,[]float64,[]int32,[]int64plus async.Pool-backed buffer helper for marshal-side reuse. - Added a generic-path preallocation helper (
vectorFixedElemSize+buf.Grow) to reducebytes.Buffergrowth for fixed-size element types. - Added extensive unit tests for vector behavior and expanded internal/public benchmarks for new int32/int64 vector cases and pooled scenarios.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| marshal.go | Adds vector fast paths, vector buffer pool helpers, and generic vector preallocation support. |
| marshal_vector_test.go | New comprehensive unit tests for vector fast paths, pooling helpers, and preallocation behavior. |
| vector_bench_test.go | Adds/extends internal benchmarks for pooled marshal and int32/int64 vector performance. |
| tests/bench/bench_vector_public_test.go | Extends public API benchmarks to cover int32/int64 vector marshal/unmarshal via gocql.Marshal/Unmarshal. |
f6229cb to
c9ea393
Compare
There was a problem hiding this comment.
Pull request overview
Introduces type-specialized fast paths for vector marshal/unmarshal (float32/float64/int32/int64) in marshal.go to avoid reflect-heavy per-element encoding, plus expanded benchmarks and a new, comprehensive unit-test suite to validate correctness and performance characteristics.
Changes:
- Add fast-path vector marshal/unmarshal implementations using
encoding/binarybulk conversion and destination-slice reuse. - Add
sync.Pool-backed buffer helpers (getVectorBuf/putVectorBuf) and generic-path preallocation viavectorFixedElemSize. - Expand internal and public benchmarks; add a large new unit test file covering many vector edge cases.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
marshal.go |
Adds specialized vector marshal/unmarshal fast paths, pooling helpers, and generic-path preallocation. |
marshal_vector_test.go |
New unit tests for vector behavior (round-trip, byte-compat, slice reuse, pool behavior, etc.). |
vector_bench_test.go |
Adds pooled/write-path/round-trip benchmarks plus int32/int64 benchmark coverage. |
tests/bench/bench_vector_public_test.go |
Extends public API benchmarks to cover int32/int64 vectors. |
c9ea393 to
79698fb
Compare
There was a problem hiding this comment.
Pull request overview
This PR accelerates CQL vector encoding/decoding in the GoCQL driver by introducing type-specialized marshal/unmarshal fast paths for common numeric vector element types, reducing reflection overhead and allocations on hot paths.
Changes:
- Add specialized marshal/unmarshal implementations for
[]float32,[]float64,[]int32,[]int64usingencoding/binary+ bit conversions, with unmarshal slice reuse. - Introduce
vectorBufPool(sync.Pool) helpers for reusable marshal buffers and add generic-path preallocation viavectorFixedElemSize()+bytes.Buffer.Grow(). - Add extensive unit tests for vector behavior and expand internal + public benchmarks for the new fast paths.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
| marshal.go | Adds vector fast paths, buffer pooling helpers, and generic-path preallocation helper. |
| marshal_vector_test.go | New unit test suite covering round-trip, compatibility, reuse, pool behavior, and prealloc. |
| vector_bench_test.go | Adds pooled/unpooled benchmarks and a simulated write-path benchmark for vectors. |
| tests/bench/bench_vector_public_test.go | Extends public API benchmarks to cover int32/int64 vectors. |
There was a problem hiding this comment.
Pull request overview
This PR adds specialized, non-reflect fast paths for marshaling/unmarshaling common vector<> element types in the GoCQL driver to significantly reduce allocations and improve throughput, along with extensive tests and expanded benchmarks.
Changes:
- Add type-specialized vector marshal/unmarshal implementations for
[]float32,[]float64,[]int32, and[]int64, plus a pooled[]bytebuffer facility for marshal fast paths. - Improve generic vector marshal performance via preallocation (
buf.Grow) when element wire size is known. - Add a comprehensive unit test suite for vector behavior and extend internal + public benchmarks for the new int32/int64 paths and pooled scenarios.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 6 comments.
| File | Description |
|---|---|
marshal.go |
Adds vector fast paths, buffer pooling helpers, overflow guards, and generic-path preallocation. |
marshal_vector_test.go |
New comprehensive unit tests covering fast paths, edge cases, and pool behavior. |
vector_bench_test.go |
Adds pooled/write-path/round-trip benchmarks and int32/int64 benchmarks. |
tests/bench/bench_vector_public_test.go |
Extends public API benchmarks to cover int32/int64 vectors. |
There was a problem hiding this comment.
Pull request overview
This PR introduces type-specialized fast paths for marshaling/unmarshaling common fixed-width vector element types (float32/float64/int32/int64) to avoid reflect-based per-element work, reducing allocations and significantly improving throughput in the driver’s vector serialization layer.
Changes:
- Added fast-path dispatch in
marshalVector/unmarshalVectorwith dedicated bulk encode/decode implementations for float32/float64/int32/int64 vectors. - Introduced a
sync.Pool-backed byte buffer reuse mechanism for vector marshaling and slice-backing reuse for unmarshaling. - Added extensive unit tests plus expanded internal/public benchmarks for the new fast paths and pooled usage patterns.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| marshal.go | Adds fast-path vector marshal/unmarshal implementations, buffer pooling helpers, and generic-path preallocation. |
| marshal_vector_test.go | Adds a comprehensive unit test suite for vector behavior, pooling, edge cases, and compatibility. |
| vector_bench_test.go | Adds/extends internal benchmarks for pooled marshal, write-path simulation, and int vector types. |
| tests/bench/bench_vector_public_test.go | Extends public API benchmarks to cover int32/int64 vector marshal/unmarshal. |
9e5efc1 to
234f670
Compare
|
Addressed review feedback: Fixed (this push):
Already fixed in prior revisions:
Not a bug (Copilot false positives):
Not changing (style preference):
|
There was a problem hiding this comment.
Pull request overview
Adds type-specialized marshal/unmarshal fast paths for common vector<...> element types to significantly reduce reflection overhead and allocations in the driver’s value encoding/decoding layer.
Changes:
- Introduces specialized marshal/unmarshal implementations for
[]float32,[]float64,[]int32,[]int64plus async.Pool-backed buffer helper for marshal fast paths. - Adds generic-path preallocation for fixed-size vector element types and improves 0-dimension handling in unmarshal.
- Expands benchmarks and adds a comprehensive new unit test suite for vector behavior/performance characteristics.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
marshal.go |
Adds vector fast paths, pooled buffer helpers, fixed-element-size prealloc, and 0-dimension unmarshal handling. |
marshal_vector_test.go |
New, extensive unit tests for vector marshal/unmarshal correctness, edge cases, and pooling behavior. |
vector_bench_test.go |
Extends internal benchmarks to cover pooled write-path simulations and int32/int64 vectors. |
tests/bench/bench_vector_public_test.go |
Extends public API benchmarks to include int32/int64 vector marshal/unmarshal. |
b0a2d82 to
118e06c
Compare
432f624 to
9410b42
Compare
|
@mykaul , it is great idea to use pooled buffers, but i think we need to make it generic so that it works for every data type the same way, and don't see any point in targeting vectors specifically. |
Add type-specialized fast paths for vector<float>, vector<double>,
vector<int>, and vector<bigint> that bypass reflect-based per-element
marshaling in favor of direct encoding/binary bulk conversion.
Changes in marshal.go:
- Type switches in marshalVector()/unmarshalVector() dispatch to
dedicated functions for []float32, []float64, []int32, []int64
before falling through to the generic reflect path.
- 8 new functions: marshalVectorFloat32, marshalVectorFloat64,
unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32,
marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64.
- sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf)
for zero-alloc steady state when callers return buffers after
the framer copies them. 64KiB cap prevents pool bloat.
- Unmarshal fast paths reuse destination slice backing array when
capacity is sufficient (zero-alloc steady state on read path).
- Generic path preallocation via vectorFixedElemSize() + buf.Grow()
for non-fast-path fixed-size types (e.g. UUID, timestamp).
- vectorByteSize() helper guards against integer overflow on 32-bit
platforms with corrupt or adversarial schema metadata.
- All fast-path errors are wrapped as MarshalError/UnmarshalError
for consistent error typing.
- dim=0 vectors correctly encode as non-nil empty values (not CQL NULL)
in both fast paths and generic path.
- Negative dimensions are rejected with clear error messages.
Benchmark results for vector<float, 1536> (typical embedding dimension):
Marshal (baseline -> optimized):
86.4 us/op -> 3.4 us/op (25x faster)
3081 allocs -> 2 allocs (99.94% fewer)
28632 B/op -> 6172 B/op (78% less memory)
Marshal with pool return (steady state):
86.4 us/op -> 1.6 us/op (54x faster)
3081 allocs -> 2 allocs (99.94% fewer)
28632 B/op -> 48 B/op (99.8% less memory)
Unmarshal (baseline -> optimized):
60.2 us/op -> 1.5 us/op (41x faster)
2 allocs -> 0 allocs (100% fewer)
6168 B/op -> 0 B/op (100% less memory)
Round-trip (baseline -> optimized, pooled):
147.8 us/op -> 3.1 us/op (48x faster)
3083 allocs -> 2 allocs (99.94% fewer)
34800 B/op -> 48 B/op (99.9% less memory)
Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%)
New test files:
- marshal_vector_test.go: 58+ unit subtests across 13 categories
(round-trip, byte-compat, slice-reuse, nil, dimension-mismatch,
empty-vector, pointer-to-slice, special-values, pool-concurrency,
oversized-not-pooled, fixed-elem-size, generic-prealloc).
- vector_bench_test.go: extended with int32/int64 and pooled benchmarks.
- tests/bench/bench_vector_public_test.go: public API benchmarks for
int32/int64 marshal/unmarshal.
Subsumes PR scylladb#744 (float fast paths) and PR scylladb#745 (generic prealloc).
Extends with int32/int64 fast paths and buffer pooling not covered by
any existing PR.
@dkropachev - just because they are large I targeted vectors. I can have a pool per type - or do you prefer one general pool for all types? |
5e62e7c to
ea4e0d7
Compare
Return pooled vector buffers to vectorBufPool after the framer copies marshalled bytes in executeQuery and executeBatch. This completes the zero-alloc steady-state cycle for vector marshal operations. In executeQuery, a defer after the marshal loop returns buffers for columns identified as pooled vector types (float32, float64, int32, int64). In executeBatch, vector buffers are collected across all batch statements and returned via a single defer. The vectorBufPoolSubtype helper centralizes the type check to keep the two call sites consistent with the marshal fast paths. Includes unit tests covering vectorBufPoolSubtype classification, single-query and batch pool return simulation, and non-pooled type safety.
Add dedicated marshal/unmarshal fast paths for UUID and TimeUUID vector elements, following the same pattern as the existing float32/float64/ int32/int64 fast paths. UUID is [16]byte with no endian conversion needed, so the fast path uses a simple copy() loop. Uses pooled buffers via getVectorBuf for zero-alloc steady state on the marshal path, and reuses the destination slice backing array on the unmarshal path. Benchmarks (vs generic reflection path): - Marshal: ~90% faster (10x speedup), 99%+ fewer allocations - Unmarshal: ~97% faster (30-35x speedup), zero allocations - Marshal+pool: additional 4x over non-pooled marshal
…arse VectorType embeds NativeType but had no NewWithError() method, so calls fell through to NativeType.NewWithError() which hit the TypeCustom fallback: goType() → asVectorType() → re-parse the full Java type string (e.g. 'org.apache.cassandra.db.marshal.VectorType(FloatType, 1536)') on every invocation. This is called per-column per-row by RowData() and MapScan, making it a hot path for vector workloads. Add VectorType.NewWithError() with fast paths for all common element types (float32, float64, int32, int64, UUID, string, bool, etc.) that return *[]T directly without reflection or string parsing. Fallback for exotic subtypes still uses SubType.NewWithError() + reflect.SliceOf but avoids the asVectorType() re-parse. Also fix zero-dimension error messages in fast-path unmarshal functions to be consistent with the generic path (check dim==0 before byte-size validation), fix copyright header in marshal_vector_test.go, and fix pre-existing session_unit_test.go build error from origin/master (hostId string → UUID type mismatch). Benchmark results (VectorType.NewWithError vs NativeType fallback): VectorType: ~17 ns/op, 24 B/op, 1 allocs/op NativeType_fallback: ~170 ns/op, 92 B/op, 4 allocs/op → 10x faster, 75% fewer allocations
81dae0b to
66ed9aa
Compare
Analysis: Generalizing the Pooled-Buffer / Fast-Path ConceptFollowing up on @dkropachev's comment:
Here is a detailed analysis of whether, how, and to what extent the two core optimizations in this PR can be generalized. Two Separable ConceptsThe PR contains two distinct optimizations that should be evaluated separately:
Why Generalization is JustifiedEvery collection marshal function (
For a
Value AnalysisConcept A (Buffer Pool) — Impact by Type
Concept B (Fast Paths) — Projected SpeedupBased on this PR's measured vector results and architectural similarity:
Risk Assessment
Complexity & Implementation PlanRecommended: two-phase approach. Phase 1: Generalized Buffer Pool (Low Risk, High Value) — ~100-150 LOC
Phase 2: Generalized Fast Paths with Go Generics (Medium Risk, High Value) — ~300-400 LOC
What should NOT be generalized:
Test Coverage Requirements
Conclusion@dkropachev is right that the buffer pooling concept should be generalized. The marshal lifecycle ( The type-specialized fast paths are also generalizable to Suggested path forward:
|
Summary
Type-specialized fast paths for
vector<float>,vector<double>,vector<int>,vector<bigint>, andvector<uuid>/vector<timeuuid>that bypass reflect-based per-element marshaling in favor of directencoding/binarybulk conversion, plussync.Poolbuffer reuse wired into the connection write path, and aVectorType.NewWithError()fast path that eliminates the expensivegoType()→asVectorType()re-parse on every call.Commit 1:
d527db1perf: optimize vector marshal/unmarshal for float32/float64/int32/int64Fast-path type switches, 8 dedicated marshal/unmarshal functions,
sync.Poolinfrastructure (getVectorBuf/putVectorBuf), unmarshal slice reuse, generic-pathbuf.Grow()preallocation viavectorFixedElemSize(), and comprehensive tests (58 subtests across 13 categories).Commit 2:
04f0783perf: wire putVectorBuf into connection write pathAdds
defer putVectorBuf(...)calls inexecuteQuery()andexecuteBatch()inconn.go, so production callers return pooled marshal buffers after the framer copies them. This closes the pool lifecycle and achieves 48 B/op steady-state on the write path.Commit 3:
d48f44fperf: add UUID/TimeUUID vector fast pathAdds marshal/unmarshal fast paths for
vector<uuid>andvector<timeuuid>— bulkcopy()of fixed 16-byte elements with zero per-element allocations. UUID vectors are common in similarity search use cases (storing document IDs alongside embeddings).Commit 4:
66ed9aaperf: add VectorType.NewWithError() to avoid goType/asVectorType re-parseVectorTypeembedsNativeTypebut had noNewWithError()method. When called, it dispatched toNativeType.NewWithError()which hitTypeCustom→goType()→asVectorType(), re-parsing the full Java type string on every call. The new method returns*[]SubTypedirectly: 10.7x faster (181ns → 17ns), 75% fewer allocs (4 → 1), 74% less memory (92B → 24B).Headline numbers (
vector<float, 1536>, typical embedding dimension)NewWithError()forVectorType(181ns → 17ns)Benchmark results
All benchmarks: 6 iterations,
benchstat, all p=0.002. Machine: 12th Gen Intel Core i7-1270P.Master =
3881f1e(origin/master), Optimized =66ed9aa(this branch HEAD).Latency (ns/op) — Master vs Optimized (all 4 commits)
Pool wiring benefit (Commit 2) — production write path
The table above measures
Marshal()/Unmarshal()via the public API, which does not return buffers to the pool. In production,executeQuery()/executeBatch()return the buffer viaputVectorBuf()after the framer copies it. ThePooledbenchmarks simulate this:Memory with pool return is 48 B/op constant, regardless of vector dimension or element type (from
sync.Poolinterface boxing overhead, irreducible). Compare to master: 18,456 B/op for float32/dim_1536, 98,328 B/op for UUID/dim_1536.Full benchstat details: memory and allocations (click to expand)
Memory (B/op)
Allocations (allocs/op)
Pooled marshal memory (B/op) — with pool return
What changed
marshal.goFast-path type switches in
marshalVector()andunmarshalVector()— before the existing reflect-based generic path, a switch oninfo.SubType.Type()intercepts[]float32,[]float64,[]int32,[]int64,[]UUIDand dispatches to 10 dedicated functions. Falls through to the generic path for all other types.10 new marshal/unmarshal functions —
marshalVectorFloat32/Float64/Int32/Int64/UUIDand correspondingunmarshalVector*. Float/int functions useencoding/binary.BigEndian.PutUint32/PutUint64withmath.Float32bits/Float64bits. UUID functions use bulkcopy()of 16-byte elements.sync.Poolbuffer reuse —vectorBufPool,getVectorBuf(size),putVectorBuf(buf)with 64 KiB cap guard.Unmarshal slice reuse — All unmarshal fast paths reuse the destination slice's backing array when capacity is sufficient, achieving zero allocations on repeated reads.
Generic path preallocation —
vectorFixedElemSize()returns the wire-format byte size for fixed-length CQL types. The generic path callsbuf.Grow()upfront.VectorType.NewWithError()— Returns*[]SubTypedirectly without going throughgoType()→asVectorType(). 10.7x faster, 75% fewer allocs.conn.goexecuteQuery()andexecuteBatch()calldefer putVectorBuf(...)on eachqueryValues.value, closing the pool lifecycle.Test files
marshal_vector_test.go— 58 unit subtests across 13 categories including UUID-specific testsvector_bench_test.go— Benchmarks for all 5 types: marshal, pooled marshal, unmarshal, across dimensions 128/384/768/1536marshal_test.go—TestVectorNewWithErrorConsistentWithGoType,TestVectorNewWithErrorReturnsSlicePointerhelpers_bench_test.go—BenchmarkVectorNewWithError,BenchmarkRowDataWithVectorHow the bottleneck was eliminated
The original generic path for
vector<float, 1536>:Marshal()1,536 times through reflect dispatch[]byteviaencFloat32(1,536 allocs)bytes.Bufferthat grew incrementally (additional allocs)framer.bufbywriteBytes()The fast path:
getVectorBuf(6144)(pooled, zero-alloc steady state)binary.BigEndian.PutUint32(buf[i*4:], math.Float32bits(v))putVectorBuf()returns the buffer to the pool afterc.exec()Design decisions
framer.bufwould save ~0.4 µs (~21% marginal over the pooled path) but requires invasive changes to thequeryValuesstruct used by all query paths. The risk/reward ratio is unfavorable.isVectorVariableLengthType) was deliberately skipped — there is a discrepancy in how some types are handled between Cassandra and ScyllaDB implementations. We focus only on the 5 types where the wire format is unambiguous.Relationship to open PRs
Replaces PR #744 (float fast paths) and PR #745 (generic prealloc)
This PR is a strict superset of both:
sync.Poolbuffer reuseconn.goVectorType.NewWithError()vectorFixedElemSize()helperbuf.Grow()preallocIf this PR merges first, #744 and #745 become no-ops and should be closed.
Orthogonal to PRs #751, #752, #753
These PRs reduce per-request allocations in the connection/framing layer. This PR reduces allocations in the marshal/unmarshal layer. They are fully complementary — different files, different allocation sites, additive benefits.
Note: PR #749 (pool write-side framers) was closed — fully superseded by
3e1e7e4on master.Depends on PR #838
PR #838 fixes a pre-existing build failure in
session_unit_test.gowherehostIdchanged fromstringtoUUIDbut test literals were not updated. This branch carries the same fix; once #838 merges, the fix becomes a no-op on rebase.