perf: generalize buffer pool for collection/vector marshal by mykaul · Pull Request #839 · scylladb/gocql

mykaul · 2026-04-10T21:17:25Z

Summary

This PR delivers four complementary performance optimizations for CQL collection/vector marshal/unmarshal:

Generalized buffer pool (Commit 2): Replace per-call &bytes.Buffer{} allocations in marshalList, marshalMap, and marshalVector with a shared sync.Pool, with pre-sizing for fixed-width element types
Type-specialized vector fast paths (Commit 3): Add direct encoding/binary marshal/unmarshal for []float32, []float64, []int32, []int64, []UUID, and []int64 (counter) vectors, bypassing the per-element reflect loop entirely
Type-specialized list/set fast paths (Commit 4): Add direct encoding/binary marshal/unmarshal for []int32, []int64, []float32, []float64 lists and sets, bypassing per-element reflect + Marshal/Unmarshal
Marshal output pool (Commit 5): Pool the []byte slices returned by all 10 fast-path marshal functions. The connection layer (executeQuery/executeBatch) returns these buffers to the pool after writeBytes copies them into the framer, achieving near-zero steady-state allocation for hot paths.

This PR is a full superset of #770 — it includes all vector-specific optimizations from that PR plus the generalized buffer pool, list/set fast paths, and marshal output pooling.

Commit 1: Fix pre-existing build failure

Fixes session_unit_test.go where 5 hostId fields used string literals after hostId was changed from string to UUID in d93b010.

Commit 2: Generalized buffer pool

Design

Pool lifecycle (no aliasing risk):

getMarshalBuf(sizeHint) → pooled *bytes.Buffer, reset + pre-grown
  ↓ write serialized data
finishMarshalBuf(buf)   → make([]byte, len) + copy; putMarshalBuf(buf)
  ↓ return owned []byte
caller gets a slice that does NOT alias pooled storage

finishMarshalBuf copies the data out before returning the buffer to the pool. Oversized buffers (>64 KiB) are discarded to prevent pool memory bloat.

Pre-sizing for collections with fixed-width elements:

List: 4 + n * (4 + elemSize)
Map: 4 + n * (4 + keySize + 4 + valSize) (only when both key and value are fixed-width)
Vector: unchanged (already pre-sized)

fixedElemSize coverage:

Type	Bytes	Status
Int, Float, Date	4	Date: new
BigInt, Double, Timestamp, Counter, Time	8	Counter, Time: new
UUID, TimeUUID	16	existing

Note: Boolean/TinyInt (1B) and SmallInt (2B) are intentionally excluded — Cassandra's vector implementation treats them as variable-length.

Phase 1 Testing

20 new unit tests in marshal_buf_pool_test.go: pool infrastructure, fixedElemSize coverage, round-trip correctness for list/map/vector, wire format compatibility, concurrent safety (100 goroutines × 1000 iterations with -race)
4 new benchmarks: list (int32, float32, bigint) and map marshal

Phase 1 Benchmark Results

In isolated single-goroutine benchmarks, the pool shows no measurable latency difference — the Go allocator's fast path handles &bytes.Buffer{} efficiently without contention. The pool benefit manifests under real workloads with concurrent marshal calls (reduced GC pressure, eliminated buffer re-growth).

Commit 3: Type-specialized vector fast paths

Design

Type-switch dispatch in marshalVector/unmarshalVector for 6 common CQL types, using encoding/binary.BigEndian directly:

marshalVectorFloat32 / unmarshalVectorFloat32
marshalVectorFloat64 / unmarshalVectorFloat64
marshalVectorInt32 / unmarshalVectorInt32
marshalVectorInt64 / unmarshalVectorInt64 (also used by TypeTimestamp)
marshalVectorUUID / unmarshalVectorUUID (also used by TypeTimeUUID)
marshalVectorCounter / unmarshalVectorCounter (length-prefixed wire format)

Key properties:

Marshal: Single allocation for output []byte, BCE-hinted inner loops, no per-element marshalData calls
Unmarshal: Zero allocations when destination slice has sufficient capacity (reuses existing slice via (*p)[:dim])
Overflow-safe sizing: vectorByteSize() checks for integer overflow before allocating
Counter vectors: Special-cased outside the isVectorVariableLengthType() guard because counters use a length-prefixed wire format (uVInt(8) + 8-byte big-endian payload) but can still bypass reflection
VectorType.NewWithError(): Returns concrete Go types (*[]float32, *[]time.Time, etc.) instead of falling back to reflection, consistent with goType()

Vector Benchmark Results

-benchtime=3s -count=5 -cpu=4, median of 5 runs. Reflect-path baseline measured using named types that bypass the fast-path type-switch.

Float32 vectors (dim_768)

Metric	Reflect path (before)	Fast path (after)	Improvement
Marshal ns/op	27,299	1,165	23x faster
Marshal B/op	9,241	3,096	66% less
Marshal allocs/op	1,538	2	99.9% less
Unmarshal ns/op	28,083	728	39x faster
Unmarshal B/op	3,096	0	100% less
Unmarshal allocs/op	2	0	100% less

All vector types (dim_768, before → after)

Type	Marshal			Unmarshal
	Reflect ns/op	Fast ns/op	Speedup	Reflect ns/op	Fast ns/op	Speedup
Float32	27,299	1,165	23x	28,083	728	39x
Float64	29,503	1,510	20x	28,750	860	33x
Int32	30,061	1,034	29x	30,109	728	41x

All vector types (dim_768, memory: before → after)

Type	Marshal B/op		Marshal allocs		Unmarshal B/op		Unmarshal allocs
	Reflect	Fast	Reflect	Fast	Reflect	Fast	Reflect	Fast
Float32	9,241	3,096	1,538	2	3,096	0	2	0
Float64	18,459	6,168	1,538	2	6,168	0	2	0
Int32	9,241	3,096	1,538	2	3,096	0	2	0

All unmarshal fast paths achieve zero allocations by reusing the destination slice when capacity suffices.

Vector Testing

46 new unit tests in vector_fastpath_test.go: round-trip correctness for all 6 types (including counter), boundary values (NaN, Inf, MaxFloat, min/max int), empty/nil vectors, zero-dimension handling, wrong-type fallback to reflect path, wire format byte-level verification, counter-specific tests (wire format, wrong element length, trailing bytes, slice reuse)
12 new benchmarks: marshal + unmarshal for float32/float64/int32/int64/UUID at dims 128/768/1536
All tests pass with -race -count=1

Commit 4: Type-specialized list/set fast paths

Design

Type-switch dispatch in marshalList/unmarshalList for 4 common fixed-size CQL types, using encoding/binary.BigEndian directly. Lists and sets use the same wire format and code path.

List wire format: [4-byte count] + N × ([4-byte elem-length] + [elem-bytes])

Supported types: TypeFloat ([]float32), TypeDouble ([]float64), TypeInt ([]int32), TypeBigInt/TypeTimestamp/TypeCounter ([]int64)

Key properties:

Marshal: Single allocation for output []byte, writes count header + per-element length prefix + data in one pass, MaxInt32 overflow guard
Unmarshal: Element-by-element parsing with signed length interpretation (matching readCollectionSize), null element support (negative length → zero value), slice reuse when capacity suffices, per-element bounds checking
Overflow-safe sizing: listByteSize() checks for integer overflow before allocating

List Benchmark Results

-benchtime=3s -count=5 -cpu=4, median of 5 runs, n=1000 elements. Reflect-path baseline measured using named types that bypass the fast-path type-switch.

Int32 lists (n=1000)

Metric	Reflect path (before)	Fast path (after)	Improvement
Marshal ns/op	50,497	2,356	21x faster
Marshal B/op	16,283	8,280	49% less
Marshal allocs/op	2,003	3	99.9% less
Unmarshal ns/op	40,540	1,911	21x faster
Unmarshal B/op	4,184	64	98% less
Unmarshal allocs/op	3	1	67% less

All list types (n=1000, before → after)

Type	Marshal			Unmarshal
	Reflect ns/op	Fast ns/op	Speedup	Reflect ns/op	Fast ns/op	Speedup
Int32	50,497	2,356	21x	40,540	1,911	21x
Float32	45,912	2,672	17x	36,868	1,972	19x
Int64	50,969	2,543	20x	41,236	1,815	23x
Float64	43,574	3,112	14x	35,784	2,013	18x

All list types (n=1000, memory: before → after)

Type	Marshal B/op		Marshal allocs		Unmarshal B/op		Unmarshal allocs
	Reflect	Fast	Reflect	Fast	Reflect	Fast	Reflect	Fast
Int32	16,283	8,280	2,003	3	4,184	64	3	1
Float32	16,283	8,280	2,003	3	4,184	64	3	1
Int64	28,381	12,376	2,003	3	8,280	64	3	1
Float64	28,381	12,376	2,003	3	8,280	64	3	1

List Testing

33 new unit tests in list_fastpath_test.go: round-trip correctness for all 4 types, null element handling (negative length prefix → zero value), empty/nil lists, slice capacity reuse, wire format byte-level compatibility, cross-path compatibility (fast marshal + reflect unmarshal), boundary values, special floats, set type coverage, TypeCounter and TypeTimestamp round-trips, overflow/truncated data/negative count error cases
8 new benchmarks: marshal + unmarshal for each type at n=10/100/1000
All tests pass with -race -count=1

Commit 5: Marshal output pool (`marshalOutputPool`)

Design

Pools the []byte slices returned by all 10 type-specialized fast-path marshal functions. After the framer copies data via writeBytes (which does f.buf = append(f.buf, p...)), the connection layer returns these buffers to the pool.

Data flow:

marshalVectorFloat32(vec, dim)  →  getMarshalOutput(size)  →  []byte
  ↓
queryValues.value = []byte
  ↓
writeBytes(framer, value)       →  f.buf = append(f.buf, p...)  // copies
  ↓
defer putMarshalOutput(value)   →  back to marshalOutputPool

Pool infrastructure:

marshalOutputPool (sync.Pool): pools []byte slices
getMarshalOutput(size): returns a []byte of exactly the requested length, from pool if a buffer with sufficient capacity exists, otherwise freshly allocated
putMarshalOutput(buf): returns buffer to pool; discards nil and oversized (>64 KiB) buffers
pooledMarshalType(TypeInfo): predicate identifying types that use pooled fast paths

Connection layer wiring:

Path	Strategy
`executeQuery`	Scan column types for poolable types before marshal loop; only install defer when at least one pooled column found (~50ns defer overhead avoided for non-pooled queries). Defer installed before the marshal loop for error-path safety.
`executeBatch`	Unconditional defer with `pooledBufs [][]byte` collection; each marshaled value checked with `pooledMarshalType`, appended to collection. Handles multi-statement batches with mixed types.

Pooled type coverage:

Category	Pooled Types
Vectors	`float`, `double`, `int`, `bigint`, `timestamp`, `counter`, `uuid`, `timeuuid`
Lists/Sets	`float`, `double`, `int`, `bigint`, `timestamp`, `counter`

Steady-state allocation: After pool warm-up, fast-path marshal calls reuse existing buffers → 0 allocs for same-sized repeated queries (the common case for prepared statements).

Output Pool Testing

9 new unit tests in marshal_buf_pool_test.go: fresh allocation, pool reuse, too-small pool buffer fallback, nil/oversized put safety, round-trip reuse, vector/list marshal pool integration, pooledMarshalType with 25 subcases covering all vector subtypes, list/set elem types, and negative cases (map, varchar, blob, boolean, native types)
All tests pass with -race -count=1

Additional fixes

Fixed VectorType.NewWithError() timestamp regression: was returning *[]int64 but the canonical goType() mapping is time.Time, so vector columns would have returned wrong Go types through Iter.RowData(). Fixed to return *[]time.Time.
Added TestVectorNewWithErrorConsistentWithGoType in marshal_test.go to guard against future NewWithError/goType mismatches.

Relation to PR #770

This PR is a full superset of #770. It includes all of #770's vector-specific optimizations (marshal/unmarshal fast paths, NewWithError fast path) plus the generalized buffer pool for lists/maps, list/set fast paths, and marshal output pooling in the connection layer. Once this PR is merged, #770 can be closed.

Replace per-call &bytes.Buffer{} allocations in marshalList, marshalMap, and marshalVector with a shared sync.Pool (marshalBufPool). Key changes: - Add marshalBufPool with getMarshalBuf/putMarshalBuf/finishMarshalBuf helper functions for safe pool lifecycle management - finishMarshalBuf copies data out before returning the buffer to the pool, eliminating any risk of aliased-slice corruption - Add pre-sizing via buf.Grow() in marshalList and marshalMap using fixedElemSize() to estimate buffer capacity for fixed-width CQL types - Rename vectorFixedElemSize -> fixedElemSize and add missing types: TinyInt (1B), SmallInt (2B), Date (4B), Counter (8B), Time (8B) - Discard oversized buffers (>64KiB) to prevent pool memory bloat - All error paths properly return buffers to the pool The pool eliminates repeated Buffer struct allocation (96 bytes) and internal slice re-growth in steady state. The copy in finishMarshalBuf is equivalent to what the old buf.Bytes() caller would have needed anyway, so the net overhead is negligible (+0.06% to +0.16% B/op). Benchmark results (6 iterations, benchstat): Marshal allocs/op: identical (no new allocations) Marshal B/op: +0.06% to +0.16% (copy overhead) Unmarshal path: completely unchanged Includes 20 new unit tests covering pool infrastructure, fixedElemSize, round-trip correctness (list/map/vector with various types including edge cases), byte-level wire format compatibility, and concurrent safety under the race detector.

Add direct encoding/binary fast paths for marshalVector and unmarshalVector that bypass per-element reflection and Marshal/Unmarshal calls for common fixed-size CQL types: float32, float64, int32, int64, and UUID. Key changes: - marshalVectorFloat32/Float64/Int32/Int64/UUID: write directly to []byte using binary.BigEndian, with BCE hints for bounds elimination - unmarshalVectorFloat32/Float64/Int32/Int64/UUID: read directly from []byte, reusing destination slice backing array when cap >= dim - Type-switch dispatch in marshalVector/unmarshalVector tries fast paths before falling through to the existing reflection-based slow path - VectorType.NewWithError() avoids expensive goType→asVectorType re-parse for common element types - vectorByteSize() overflow check helper - Comprehensive tests: round-trip, special values, nil/zero-dim, slice reuse, dimension mismatch, wire format verification, fallthrough - Benchmarks for all fast-path types Performance (768-dim float32 vector, representative): Marshal: 22,500 → 1,090 ns/op (~21x faster), 1538→2 allocs Unmarshal: 17,700 → 650 ns/op (~27x faster), 2→0 allocs

Add direct encoding/binary fast paths for marshalList/unmarshalList with []int32, []int64, []float32, []float64 element types, bypassing per-element reflection and the generic Marshal()/Unmarshal() calls. List wire format: [4-byte count] + N × ([4-byte elem-length] + [elem-bytes]) Marshal fast paths: single allocation for output []byte, BCE-friendly inner loops writing length prefix + element data directly. Unmarshal fast paths: element-by-element parsing with signed length interpretation (matching readCollectionSize), null element support (negative length → zero value), and slice reuse when capacity suffices. Coverage: TypeFloat, TypeDouble, TypeInt, TypeBigInt, TypeTimestamp, TypeCounter — all types whose wire representation is a fixed-size big-endian encoding. 33 new unit tests in list_fastpath_test.go: - Round-trip correctness for all 4 types - Null element handling (negative length prefix → zero value) - Empty/nil list handling - Slice capacity reuse - Wire format byte-level compatibility - Cross-path compatibility (fast marshal + reflect unmarshal) - Boundary values, special floats (NaN, Inf) - Set type coverage (same code path) - TypeCounter and TypeTimestamp round-trips - Overflow, truncated data, negative count error cases 8 new benchmarks (marshal + unmarshal × 4 types × 3 sizes).

Add marshalOutputPool (sync.Pool) to recycle []byte slices returned by the 10 type-specialized marshal functions (vectors and lists/sets). The connection layer (executeQuery, executeBatch) returns these buffers to the pool after the framer copies them via writeBytes. Key changes: - getMarshalOutput/putMarshalOutput: pool management with cap guard - pooledMarshalType: identifies types using pooled marshal fast paths - executeQuery: scan columns for poolable types, install defer before marshal loop so buffers are returned even on mid-loop errors - executeBatch: unconditional defer with pooledBufs collection, also handles error-path cleanup correctly - All 10 fast-path marshal functions use getMarshalOutput instead of make([]byte, size) - 9 new unit tests covering pool mechanics, round-trip reuse, and pooledMarshalType with 25 type coverage subcases

mykaul force-pushed the generalize-marshal-buf-pool branch 4 times, most recently from c81589f to a7c78e2 Compare April 12, 2026 09:48

mykaul added 4 commits April 12, 2026 16:31

mykaul force-pushed the generalize-marshal-buf-pool branch from f4b7316 to eb77cd7 Compare April 12, 2026 13:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: generalize buffer pool for collection/vector marshal#839

perf: generalize buffer pool for collection/vector marshal#839
mykaul wants to merge 4 commits intoscylladb:masterfrom
mykaul:generalize-marshal-buf-pool

mykaul commented Apr 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mykaul commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Commit 1: Fix pre-existing build failure

Commit 2: Generalized buffer pool

Design

Phase 1 Testing

Phase 1 Benchmark Results

Commit 3: Type-specialized vector fast paths

Design

Vector Benchmark Results

Float32 vectors (dim_768)

All vector types (dim_768, before → after)

All vector types (dim_768, memory: before → after)

Vector Testing

Commit 4: Type-specialized list/set fast paths

Design

List Benchmark Results

Int32 lists (n=1000)

All list types (n=1000, before → after)

All list types (n=1000, memory: before → after)

List Testing

Commit 5: Marshal output pool (marshalOutputPool)

Design

Output Pool Testing

Additional fixes

Relation to PR #770

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mykaul commented Apr 10, 2026 •

edited

Loading

Commit 5: Marshal output pool (`marshalOutputPool`)