Skip to content

perf: generalize buffer pool for collection/vector marshal#839

Draft
mykaul wants to merge 4 commits intoscylladb:masterfrom
mykaul:generalize-marshal-buf-pool
Draft

perf: generalize buffer pool for collection/vector marshal#839
mykaul wants to merge 4 commits intoscylladb:masterfrom
mykaul:generalize-marshal-buf-pool

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Apr 10, 2026

Summary

This PR delivers four complementary performance optimizations for CQL collection/vector marshal/unmarshal:

  1. Generalized buffer pool (Commit 2): Replace per-call &bytes.Buffer{} allocations in marshalList, marshalMap, and marshalVector with a shared sync.Pool, with pre-sizing for fixed-width element types
  2. Type-specialized vector fast paths (Commit 3): Add direct encoding/binary marshal/unmarshal for []float32, []float64, []int32, []int64, []UUID, and []int64 (counter) vectors, bypassing the per-element reflect loop entirely
  3. Type-specialized list/set fast paths (Commit 4): Add direct encoding/binary marshal/unmarshal for []int32, []int64, []float32, []float64 lists and sets, bypassing per-element reflect + Marshal/Unmarshal
  4. Marshal output pool (Commit 5): Pool the []byte slices returned by all 10 fast-path marshal functions. The connection layer (executeQuery/executeBatch) returns these buffers to the pool after writeBytes copies them into the framer, achieving near-zero steady-state allocation for hot paths.

This PR is a full superset of #770 — it includes all vector-specific optimizations from that PR plus the generalized buffer pool, list/set fast paths, and marshal output pooling.

Commit 1: Fix pre-existing build failure

Fixes session_unit_test.go where 5 hostId fields used string literals after hostId was changed from string to UUID in d93b010.

Commit 2: Generalized buffer pool

Design

Pool lifecycle (no aliasing risk):

getMarshalBuf(sizeHint) → pooled *bytes.Buffer, reset + pre-grown
  ↓ write serialized data
finishMarshalBuf(buf)   → make([]byte, len) + copy; putMarshalBuf(buf)
  ↓ return owned []byte
caller gets a slice that does NOT alias pooled storage

finishMarshalBuf copies the data out before returning the buffer to the pool. Oversized buffers (>64 KiB) are discarded to prevent pool memory bloat.

Pre-sizing for collections with fixed-width elements:

  • List: 4 + n * (4 + elemSize)
  • Map: 4 + n * (4 + keySize + 4 + valSize) (only when both key and value are fixed-width)
  • Vector: unchanged (already pre-sized)

fixedElemSize coverage:

Type Bytes Status
Int, Float, Date 4 Date: new
BigInt, Double, Timestamp, Counter, Time 8 Counter, Time: new
UUID, TimeUUID 16 existing

Note: Boolean/TinyInt (1B) and SmallInt (2B) are intentionally excluded — Cassandra's vector implementation treats them as variable-length.

Phase 1 Testing

  • 20 new unit tests in marshal_buf_pool_test.go: pool infrastructure, fixedElemSize coverage, round-trip correctness for list/map/vector, wire format compatibility, concurrent safety (100 goroutines × 1000 iterations with -race)
  • 4 new benchmarks: list (int32, float32, bigint) and map marshal

Phase 1 Benchmark Results

In isolated single-goroutine benchmarks, the pool shows no measurable latency difference — the Go allocator's fast path handles &bytes.Buffer{} efficiently without contention. The pool benefit manifests under real workloads with concurrent marshal calls (reduced GC pressure, eliminated buffer re-growth).

Commit 3: Type-specialized vector fast paths

Design

Type-switch dispatch in marshalVector/unmarshalVector for 6 common CQL types, using encoding/binary.BigEndian directly:

  • marshalVectorFloat32 / unmarshalVectorFloat32
  • marshalVectorFloat64 / unmarshalVectorFloat64
  • marshalVectorInt32 / unmarshalVectorInt32
  • marshalVectorInt64 / unmarshalVectorInt64 (also used by TypeTimestamp)
  • marshalVectorUUID / unmarshalVectorUUID (also used by TypeTimeUUID)
  • marshalVectorCounter / unmarshalVectorCounter (length-prefixed wire format)

Key properties:

  • Marshal: Single allocation for output []byte, BCE-hinted inner loops, no per-element marshalData calls
  • Unmarshal: Zero allocations when destination slice has sufficient capacity (reuses existing slice via (*p)[:dim])
  • Overflow-safe sizing: vectorByteSize() checks for integer overflow before allocating
  • Counter vectors: Special-cased outside the isVectorVariableLengthType() guard because counters use a length-prefixed wire format (uVInt(8) + 8-byte big-endian payload) but can still bypass reflection
  • VectorType.NewWithError(): Returns concrete Go types (*[]float32, *[]time.Time, etc.) instead of falling back to reflection, consistent with goType()

Vector Benchmark Results

-benchtime=3s -count=5 -cpu=4, median of 5 runs. Reflect-path baseline measured using named types that bypass the fast-path type-switch.

Float32 vectors (dim_768)

Metric Reflect path (before) Fast path (after) Improvement
Marshal ns/op 27,299 1,165 23x faster
Marshal B/op 9,241 3,096 66% less
Marshal allocs/op 1,538 2 99.9% less
Unmarshal ns/op 28,083 728 39x faster
Unmarshal B/op 3,096 0 100% less
Unmarshal allocs/op 2 0 100% less

All vector types (dim_768, before → after)

Type Marshal Unmarshal
Reflect ns/op Fast ns/op Speedup Reflect ns/op Fast ns/op Speedup
Float32 27,299 1,165 23x 28,083 728 39x
Float64 29,503 1,510 20x 28,750 860 33x
Int32 30,061 1,034 29x 30,109 728 41x

All vector types (dim_768, memory: before → after)

Type Marshal B/op Marshal allocs Unmarshal B/op Unmarshal allocs
Reflect Fast Reflect Fast Reflect Fast Reflect Fast
Float32 9,241 3,096 1,538 2 3,096 0 2 0
Float64 18,459 6,168 1,538 2 6,168 0 2 0
Int32 9,241 3,096 1,538 2 3,096 0 2 0

All unmarshal fast paths achieve zero allocations by reusing the destination slice when capacity suffices.

Vector Testing

  • 46 new unit tests in vector_fastpath_test.go: round-trip correctness for all 6 types (including counter), boundary values (NaN, Inf, MaxFloat, min/max int), empty/nil vectors, zero-dimension handling, wrong-type fallback to reflect path, wire format byte-level verification, counter-specific tests (wire format, wrong element length, trailing bytes, slice reuse)
  • 12 new benchmarks: marshal + unmarshal for float32/float64/int32/int64/UUID at dims 128/768/1536
  • All tests pass with -race -count=1

Commit 4: Type-specialized list/set fast paths

Design

Type-switch dispatch in marshalList/unmarshalList for 4 common fixed-size CQL types, using encoding/binary.BigEndian directly. Lists and sets use the same wire format and code path.

List wire format: [4-byte count] + N × ([4-byte elem-length] + [elem-bytes])

Supported types: TypeFloat ([]float32), TypeDouble ([]float64), TypeInt ([]int32), TypeBigInt/TypeTimestamp/TypeCounter ([]int64)

Key properties:

  • Marshal: Single allocation for output []byte, writes count header + per-element length prefix + data in one pass, MaxInt32 overflow guard
  • Unmarshal: Element-by-element parsing with signed length interpretation (matching readCollectionSize), null element support (negative length → zero value), slice reuse when capacity suffices, per-element bounds checking
  • Overflow-safe sizing: listByteSize() checks for integer overflow before allocating

List Benchmark Results

-benchtime=3s -count=5 -cpu=4, median of 5 runs, n=1000 elements. Reflect-path baseline measured using named types that bypass the fast-path type-switch.

Int32 lists (n=1000)

Metric Reflect path (before) Fast path (after) Improvement
Marshal ns/op 50,497 2,356 21x faster
Marshal B/op 16,283 8,280 49% less
Marshal allocs/op 2,003 3 99.9% less
Unmarshal ns/op 40,540 1,911 21x faster
Unmarshal B/op 4,184 64 98% less
Unmarshal allocs/op 3 1 67% less

All list types (n=1000, before → after)

Type Marshal Unmarshal
Reflect ns/op Fast ns/op Speedup Reflect ns/op Fast ns/op Speedup
Int32 50,497 2,356 21x 40,540 1,911 21x
Float32 45,912 2,672 17x 36,868 1,972 19x
Int64 50,969 2,543 20x 41,236 1,815 23x
Float64 43,574 3,112 14x 35,784 2,013 18x

All list types (n=1000, memory: before → after)

Type Marshal B/op Marshal allocs Unmarshal B/op Unmarshal allocs
Reflect Fast Reflect Fast Reflect Fast Reflect Fast
Int32 16,283 8,280 2,003 3 4,184 64 3 1
Float32 16,283 8,280 2,003 3 4,184 64 3 1
Int64 28,381 12,376 2,003 3 8,280 64 3 1
Float64 28,381 12,376 2,003 3 8,280 64 3 1

List Testing

  • 33 new unit tests in list_fastpath_test.go: round-trip correctness for all 4 types, null element handling (negative length prefix → zero value), empty/nil lists, slice capacity reuse, wire format byte-level compatibility, cross-path compatibility (fast marshal + reflect unmarshal), boundary values, special floats, set type coverage, TypeCounter and TypeTimestamp round-trips, overflow/truncated data/negative count error cases
  • 8 new benchmarks: marshal + unmarshal for each type at n=10/100/1000
  • All tests pass with -race -count=1

Commit 5: Marshal output pool (marshalOutputPool)

Design

Pools the []byte slices returned by all 10 type-specialized fast-path marshal functions. After the framer copies data via writeBytes (which does f.buf = append(f.buf, p...)), the connection layer returns these buffers to the pool.

Data flow:

marshalVectorFloat32(vec, dim)  →  getMarshalOutput(size)  →  []byte
  ↓
queryValues.value = []byte
  ↓
writeBytes(framer, value)       →  f.buf = append(f.buf, p...)  // copies
  ↓
defer putMarshalOutput(value)   →  back to marshalOutputPool

Pool infrastructure:

  • marshalOutputPool (sync.Pool): pools []byte slices
  • getMarshalOutput(size): returns a []byte of exactly the requested length, from pool if a buffer with sufficient capacity exists, otherwise freshly allocated
  • putMarshalOutput(buf): returns buffer to pool; discards nil and oversized (>64 KiB) buffers
  • pooledMarshalType(TypeInfo): predicate identifying types that use pooled fast paths

Connection layer wiring:

Path Strategy
executeQuery Scan column types for poolable types before marshal loop; only install defer when at least one pooled column found (~50ns defer overhead avoided for non-pooled queries). Defer installed before the marshal loop for error-path safety.
executeBatch Unconditional defer with pooledBufs [][]byte collection; each marshaled value checked with pooledMarshalType, appended to collection. Handles multi-statement batches with mixed types.

Pooled type coverage:

Category Pooled Types
Vectors float, double, int, bigint, timestamp, counter, uuid, timeuuid
Lists/Sets float, double, int, bigint, timestamp, counter

Steady-state allocation: After pool warm-up, fast-path marshal calls reuse existing buffers → 0 allocs for same-sized repeated queries (the common case for prepared statements).

Output Pool Testing

  • 9 new unit tests in marshal_buf_pool_test.go: fresh allocation, pool reuse, too-small pool buffer fallback, nil/oversized put safety, round-trip reuse, vector/list marshal pool integration, pooledMarshalType with 25 subcases covering all vector subtypes, list/set elem types, and negative cases (map, varchar, blob, boolean, native types)
  • All tests pass with -race -count=1

Additional fixes

  • Fixed VectorType.NewWithError() timestamp regression: was returning *[]int64 but the canonical goType() mapping is time.Time, so vector columns would have returned wrong Go types through Iter.RowData(). Fixed to return *[]time.Time.
  • Added TestVectorNewWithErrorConsistentWithGoType in marshal_test.go to guard against future NewWithError/goType mismatches.

Relation to PR #770

This PR is a full superset of #770. It includes all of #770's vector-specific optimizations (marshal/unmarshal fast paths, NewWithError fast path) plus the generalized buffer pool for lists/maps, list/set fast paths, and marshal output pooling in the connection layer. Once this PR is merged, #770 can be closed.

@mykaul mykaul force-pushed the generalize-marshal-buf-pool branch 4 times, most recently from c81589f to a7c78e2 Compare April 12, 2026 09:48
mykaul added 4 commits April 12, 2026 16:31
Replace per-call &bytes.Buffer{} allocations in marshalList, marshalMap,
and marshalVector with a shared sync.Pool (marshalBufPool).

Key changes:
- Add marshalBufPool with getMarshalBuf/putMarshalBuf/finishMarshalBuf
  helper functions for safe pool lifecycle management
- finishMarshalBuf copies data out before returning the buffer to the
  pool, eliminating any risk of aliased-slice corruption
- Add pre-sizing via buf.Grow() in marshalList and marshalMap using
  fixedElemSize() to estimate buffer capacity for fixed-width CQL types
- Rename vectorFixedElemSize -> fixedElemSize and add missing types:
  TinyInt (1B), SmallInt (2B), Date (4B), Counter (8B), Time (8B)
- Discard oversized buffers (>64KiB) to prevent pool memory bloat
- All error paths properly return buffers to the pool

The pool eliminates repeated Buffer struct allocation (96 bytes) and
internal slice re-growth in steady state. The copy in finishMarshalBuf
is equivalent to what the old buf.Bytes() caller would have needed
anyway, so the net overhead is negligible (+0.06% to +0.16% B/op).

Benchmark results (6 iterations, benchstat):
  Marshal allocs/op: identical (no new allocations)
  Marshal B/op:      +0.06% to +0.16% (copy overhead)
  Unmarshal path:    completely unchanged

Includes 20 new unit tests covering pool infrastructure, fixedElemSize,
round-trip correctness (list/map/vector with various types including
edge cases), byte-level wire format compatibility, and concurrent
safety under the race detector.
Add direct encoding/binary fast paths for marshalVector and
unmarshalVector that bypass per-element reflection and Marshal/Unmarshal
calls for common fixed-size CQL types: float32, float64, int32, int64,
and UUID.

Key changes:
- marshalVectorFloat32/Float64/Int32/Int64/UUID: write directly to
  []byte using binary.BigEndian, with BCE hints for bounds elimination
- unmarshalVectorFloat32/Float64/Int32/Int64/UUID: read directly from
  []byte, reusing destination slice backing array when cap >= dim
- Type-switch dispatch in marshalVector/unmarshalVector tries fast paths
  before falling through to the existing reflection-based slow path
- VectorType.NewWithError() avoids expensive goType→asVectorType
  re-parse for common element types
- vectorByteSize() overflow check helper
- Comprehensive tests: round-trip, special values, nil/zero-dim, slice
  reuse, dimension mismatch, wire format verification, fallthrough
- Benchmarks for all fast-path types

Performance (768-dim float32 vector, representative):
  Marshal:   22,500 → 1,090 ns/op  (~21x faster), 1538→2 allocs
  Unmarshal: 17,700 →   650 ns/op  (~27x faster), 2→0 allocs
Add direct encoding/binary fast paths for marshalList/unmarshalList
with []int32, []int64, []float32, []float64 element types, bypassing
per-element reflection and the generic Marshal()/Unmarshal() calls.

List wire format: [4-byte count] + N × ([4-byte elem-length] + [elem-bytes])

Marshal fast paths: single allocation for output []byte, BCE-friendly
inner loops writing length prefix + element data directly.

Unmarshal fast paths: element-by-element parsing with signed length
interpretation (matching readCollectionSize), null element support
(negative length → zero value), and slice reuse when capacity suffices.

Coverage: TypeFloat, TypeDouble, TypeInt, TypeBigInt, TypeTimestamp,
TypeCounter — all types whose wire representation is a fixed-size
big-endian encoding.

33 new unit tests in list_fastpath_test.go:
- Round-trip correctness for all 4 types
- Null element handling (negative length prefix → zero value)
- Empty/nil list handling
- Slice capacity reuse
- Wire format byte-level compatibility
- Cross-path compatibility (fast marshal + reflect unmarshal)
- Boundary values, special floats (NaN, Inf)
- Set type coverage (same code path)
- TypeCounter and TypeTimestamp round-trips
- Overflow, truncated data, negative count error cases

8 new benchmarks (marshal + unmarshal × 4 types × 3 sizes).
Add marshalOutputPool (sync.Pool) to recycle []byte slices returned by
the 10 type-specialized marshal functions (vectors and lists/sets).
The connection layer (executeQuery, executeBatch) returns these buffers
to the pool after the framer copies them via writeBytes.

Key changes:
- getMarshalOutput/putMarshalOutput: pool management with cap guard
- pooledMarshalType: identifies types using pooled marshal fast paths
- executeQuery: scan columns for poolable types, install defer before
  marshal loop so buffers are returned even on mid-loop errors
- executeBatch: unconditional defer with pooledBufs collection, also
  handles error-path cleanup correctly
- All 10 fast-path marshal functions use getMarshalOutput instead of
  make([]byte, size)
- 9 new unit tests covering pool mechanics, round-trip reuse, and
  pooledMarshalType with 25 type coverage subcases
@mykaul mykaul force-pushed the generalize-marshal-buf-pool branch from f4b7316 to eb77cd7 Compare April 12, 2026 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant