perf: generalize buffer pool for collection/vector marshal#839
Draft
mykaul wants to merge 4 commits intoscylladb:masterfrom
Draft
perf: generalize buffer pool for collection/vector marshal#839mykaul wants to merge 4 commits intoscylladb:masterfrom
mykaul wants to merge 4 commits intoscylladb:masterfrom
Conversation
c81589f to
a7c78e2
Compare
Replace per-call &bytes.Buffer{} allocations in marshalList, marshalMap,
and marshalVector with a shared sync.Pool (marshalBufPool).
Key changes:
- Add marshalBufPool with getMarshalBuf/putMarshalBuf/finishMarshalBuf
helper functions for safe pool lifecycle management
- finishMarshalBuf copies data out before returning the buffer to the
pool, eliminating any risk of aliased-slice corruption
- Add pre-sizing via buf.Grow() in marshalList and marshalMap using
fixedElemSize() to estimate buffer capacity for fixed-width CQL types
- Rename vectorFixedElemSize -> fixedElemSize and add missing types:
TinyInt (1B), SmallInt (2B), Date (4B), Counter (8B), Time (8B)
- Discard oversized buffers (>64KiB) to prevent pool memory bloat
- All error paths properly return buffers to the pool
The pool eliminates repeated Buffer struct allocation (96 bytes) and
internal slice re-growth in steady state. The copy in finishMarshalBuf
is equivalent to what the old buf.Bytes() caller would have needed
anyway, so the net overhead is negligible (+0.06% to +0.16% B/op).
Benchmark results (6 iterations, benchstat):
Marshal allocs/op: identical (no new allocations)
Marshal B/op: +0.06% to +0.16% (copy overhead)
Unmarshal path: completely unchanged
Includes 20 new unit tests covering pool infrastructure, fixedElemSize,
round-trip correctness (list/map/vector with various types including
edge cases), byte-level wire format compatibility, and concurrent
safety under the race detector.
Add direct encoding/binary fast paths for marshalVector and unmarshalVector that bypass per-element reflection and Marshal/Unmarshal calls for common fixed-size CQL types: float32, float64, int32, int64, and UUID. Key changes: - marshalVectorFloat32/Float64/Int32/Int64/UUID: write directly to []byte using binary.BigEndian, with BCE hints for bounds elimination - unmarshalVectorFloat32/Float64/Int32/Int64/UUID: read directly from []byte, reusing destination slice backing array when cap >= dim - Type-switch dispatch in marshalVector/unmarshalVector tries fast paths before falling through to the existing reflection-based slow path - VectorType.NewWithError() avoids expensive goType→asVectorType re-parse for common element types - vectorByteSize() overflow check helper - Comprehensive tests: round-trip, special values, nil/zero-dim, slice reuse, dimension mismatch, wire format verification, fallthrough - Benchmarks for all fast-path types Performance (768-dim float32 vector, representative): Marshal: 22,500 → 1,090 ns/op (~21x faster), 1538→2 allocs Unmarshal: 17,700 → 650 ns/op (~27x faster), 2→0 allocs
Add direct encoding/binary fast paths for marshalList/unmarshalList with []int32, []int64, []float32, []float64 element types, bypassing per-element reflection and the generic Marshal()/Unmarshal() calls. List wire format: [4-byte count] + N × ([4-byte elem-length] + [elem-bytes]) Marshal fast paths: single allocation for output []byte, BCE-friendly inner loops writing length prefix + element data directly. Unmarshal fast paths: element-by-element parsing with signed length interpretation (matching readCollectionSize), null element support (negative length → zero value), and slice reuse when capacity suffices. Coverage: TypeFloat, TypeDouble, TypeInt, TypeBigInt, TypeTimestamp, TypeCounter — all types whose wire representation is a fixed-size big-endian encoding. 33 new unit tests in list_fastpath_test.go: - Round-trip correctness for all 4 types - Null element handling (negative length prefix → zero value) - Empty/nil list handling - Slice capacity reuse - Wire format byte-level compatibility - Cross-path compatibility (fast marshal + reflect unmarshal) - Boundary values, special floats (NaN, Inf) - Set type coverage (same code path) - TypeCounter and TypeTimestamp round-trips - Overflow, truncated data, negative count error cases 8 new benchmarks (marshal + unmarshal × 4 types × 3 sizes).
Add marshalOutputPool (sync.Pool) to recycle []byte slices returned by the 10 type-specialized marshal functions (vectors and lists/sets). The connection layer (executeQuery, executeBatch) returns these buffers to the pool after the framer copies them via writeBytes. Key changes: - getMarshalOutput/putMarshalOutput: pool management with cap guard - pooledMarshalType: identifies types using pooled marshal fast paths - executeQuery: scan columns for poolable types, install defer before marshal loop so buffers are returned even on mid-loop errors - executeBatch: unconditional defer with pooledBufs collection, also handles error-path cleanup correctly - All 10 fast-path marshal functions use getMarshalOutput instead of make([]byte, size) - 9 new unit tests covering pool mechanics, round-trip reuse, and pooledMarshalType with 25 type coverage subcases
f4b7316 to
eb77cd7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR delivers four complementary performance optimizations for CQL collection/vector marshal/unmarshal:
&bytes.Buffer{}allocations inmarshalList,marshalMap, andmarshalVectorwith a sharedsync.Pool, with pre-sizing for fixed-width element typesencoding/binarymarshal/unmarshal for[]float32,[]float64,[]int32,[]int64,[]UUID, and[]int64(counter) vectors, bypassing the per-element reflect loop entirelyencoding/binarymarshal/unmarshal for[]int32,[]int64,[]float32,[]float64lists and sets, bypassing per-element reflect + Marshal/Unmarshal[]byteslices returned by all 10 fast-path marshal functions. The connection layer (executeQuery/executeBatch) returns these buffers to the pool afterwriteBytescopies them into the framer, achieving near-zero steady-state allocation for hot paths.This PR is a full superset of #770 — it includes all vector-specific optimizations from that PR plus the generalized buffer pool, list/set fast paths, and marshal output pooling.
Commit 1: Fix pre-existing build failure
Fixes
session_unit_test.gowhere 5hostIdfields used string literals afterhostIdwas changed fromstringtoUUIDin d93b010.Commit 2: Generalized buffer pool
Design
Pool lifecycle (no aliasing risk):
finishMarshalBufcopies the data out before returning the buffer to the pool. Oversized buffers (>64 KiB) are discarded to prevent pool memory bloat.Pre-sizing for collections with fixed-width elements:
4 + n * (4 + elemSize)4 + n * (4 + keySize + 4 + valSize)(only when both key and value are fixed-width)fixedElemSizecoverage:Note: Boolean/TinyInt (1B) and SmallInt (2B) are intentionally excluded — Cassandra's vector implementation treats them as variable-length.
Phase 1 Testing
marshal_buf_pool_test.go: pool infrastructure,fixedElemSizecoverage, round-trip correctness for list/map/vector, wire format compatibility, concurrent safety (100 goroutines × 1000 iterations with-race)Phase 1 Benchmark Results
In isolated single-goroutine benchmarks, the pool shows no measurable latency difference — the Go allocator's fast path handles
&bytes.Buffer{}efficiently without contention. The pool benefit manifests under real workloads with concurrent marshal calls (reduced GC pressure, eliminated buffer re-growth).Commit 3: Type-specialized vector fast paths
Design
Type-switch dispatch in
marshalVector/unmarshalVectorfor 6 common CQL types, usingencoding/binary.BigEndiandirectly:marshalVectorFloat32/unmarshalVectorFloat32marshalVectorFloat64/unmarshalVectorFloat64marshalVectorInt32/unmarshalVectorInt32marshalVectorInt64/unmarshalVectorInt64(also used by TypeTimestamp)marshalVectorUUID/unmarshalVectorUUID(also used by TypeTimeUUID)marshalVectorCounter/unmarshalVectorCounter(length-prefixed wire format)Key properties:
[]byte, BCE-hinted inner loops, no per-elementmarshalDatacalls(*p)[:dim])vectorByteSize()checks for integer overflow before allocatingisVectorVariableLengthType()guard because counters use a length-prefixed wire format (uVInt(8)+ 8-byte big-endian payload) but can still bypass reflectionVectorType.NewWithError(): Returns concrete Go types (*[]float32,*[]time.Time, etc.) instead of falling back to reflection, consistent withgoType()Vector Benchmark Results
-benchtime=3s -count=5 -cpu=4, median of 5 runs. Reflect-path baseline measured using named types that bypass the fast-path type-switch.Float32 vectors (dim_768)
All vector types (dim_768, before → after)
All vector types (dim_768, memory: before → after)
All unmarshal fast paths achieve zero allocations by reusing the destination slice when capacity suffices.
Vector Testing
vector_fastpath_test.go: round-trip correctness for all 6 types (including counter), boundary values (NaN, Inf, MaxFloat, min/max int), empty/nil vectors, zero-dimension handling, wrong-type fallback to reflect path, wire format byte-level verification, counter-specific tests (wire format, wrong element length, trailing bytes, slice reuse)-race -count=1Commit 4: Type-specialized list/set fast paths
Design
Type-switch dispatch in
marshalList/unmarshalListfor 4 common fixed-size CQL types, usingencoding/binary.BigEndiandirectly. Lists and sets use the same wire format and code path.List wire format:
[4-byte count] + N × ([4-byte elem-length] + [elem-bytes])Supported types:
TypeFloat([]float32),TypeDouble([]float64),TypeInt([]int32),TypeBigInt/TypeTimestamp/TypeCounter([]int64)Key properties:
[]byte, writes count header + per-element length prefix + data in one pass,MaxInt32overflow guardreadCollectionSize), null element support (negative length → zero value), slice reuse when capacity suffices, per-element bounds checkinglistByteSize()checks for integer overflow before allocatingList Benchmark Results
-benchtime=3s -count=5 -cpu=4, median of 5 runs, n=1000 elements. Reflect-path baseline measured using named types that bypass the fast-path type-switch.Int32 lists (n=1000)
All list types (n=1000, before → after)
All list types (n=1000, memory: before → after)
List Testing
list_fastpath_test.go: round-trip correctness for all 4 types, null element handling (negative length prefix → zero value), empty/nil lists, slice capacity reuse, wire format byte-level compatibility, cross-path compatibility (fast marshal + reflect unmarshal), boundary values, special floats, set type coverage, TypeCounter and TypeTimestamp round-trips, overflow/truncated data/negative count error cases-race -count=1Commit 5: Marshal output pool (
marshalOutputPool)Design
Pools the
[]byteslices returned by all 10 type-specialized fast-path marshal functions. After the framer copies data viawriteBytes(which doesf.buf = append(f.buf, p...)), the connection layer returns these buffers to the pool.Data flow:
Pool infrastructure:
marshalOutputPool(sync.Pool): pools[]byteslicesgetMarshalOutput(size): returns a[]byteof exactly the requested length, from pool if a buffer with sufficient capacity exists, otherwise freshly allocatedputMarshalOutput(buf): returns buffer to pool; discards nil and oversized (>64 KiB) bufferspooledMarshalType(TypeInfo): predicate identifying types that use pooled fast pathsConnection layer wiring:
executeQueryexecuteBatchpooledBufs [][]bytecollection; each marshaled value checked withpooledMarshalType, appended to collection. Handles multi-statement batches with mixed types.Pooled type coverage:
float,double,int,bigint,timestamp,counter,uuid,timeuuidfloat,double,int,bigint,timestamp,counterSteady-state allocation: After pool warm-up, fast-path marshal calls reuse existing buffers → 0 allocs for same-sized repeated queries (the common case for prepared statements).
Output Pool Testing
marshal_buf_pool_test.go: fresh allocation, pool reuse, too-small pool buffer fallback, nil/oversized put safety, round-trip reuse, vector/list marshal pool integration,pooledMarshalTypewith 25 subcases covering all vector subtypes, list/set elem types, and negative cases (map, varchar, blob, boolean, native types)-race -count=1Additional fixes
VectorType.NewWithError()timestamp regression: was returning*[]int64but the canonicalgoType()mapping istime.Time, so vector columns would have returned wrong Go types throughIter.RowData(). Fixed to return*[]time.Time.TestVectorNewWithErrorConsistentWithGoTypeinmarshal_test.goto guard against futureNewWithError/goTypemismatches.Relation to PR #770
This PR is a full superset of #770. It includes all of #770's vector-specific optimizations (marshal/unmarshal fast paths,
NewWithErrorfast path) plus the generalized buffer pool for lists/maps, list/set fast paths, and marshal output pooling in the connection layer. Once this PR is merged, #770 can be closed.