Commit c5fb0b2
committed
perf: optimize vector marshal/unmarshal for float32/float64/int32/int64
Add type-specialized fast paths for vector<float>, vector<double>,
vector<int>, and vector<bigint> that bypass reflect-based per-element
marshaling in favor of direct encoding/binary bulk conversion.
Changes in marshal.go:
- Type switches in marshalVector()/unmarshalVector() dispatch to
dedicated functions for []float32, []float64, []int32, []int64
before falling through to the generic reflect path.
- 8 new functions: marshalVectorFloat32, marshalVectorFloat64,
unmarshalVectorFloat32, unmarshalVectorFloat64, marshalVectorInt32,
marshalVectorInt64, unmarshalVectorInt32, unmarshalVectorInt64.
- sync.Pool buffer reuse (vectorBufPool/getVectorBuf/putVectorBuf)
for zero-alloc steady state when callers return buffers after
the framer copies them. 64KiB cap prevents pool bloat.
- Unmarshal fast paths reuse destination slice backing array when
capacity is sufficient (zero-alloc steady state on read path).
- Generic path preallocation via vectorFixedElemSize() + buf.Grow()
for non-fast-path fixed-size types (e.g. UUID, timestamp).
- vectorByteSize() helper guards against integer overflow on 32-bit
platforms with corrupt or adversarial schema metadata.
- All fast-path errors are wrapped as MarshalError/UnmarshalError
for consistent error typing.
- dim=0 vectors correctly encode as non-nil empty values (not CQL NULL)
in both fast paths and generic path.
- Negative dimensions are rejected with clear error messages.
Benchmark results for vector<float, 1536> (typical embedding dimension):
Marshal (baseline -> optimized):
86.4 us/op -> 3.4 us/op (25x faster)
3081 allocs -> 2 allocs (99.94% fewer)
28632 B/op -> 6172 B/op (78% less memory)
Marshal with pool return (steady state):
86.4 us/op -> 1.6 us/op (54x faster)
3081 allocs -> 2 allocs (99.94% fewer)
28632 B/op -> 48 B/op (99.8% less memory)
Unmarshal (baseline -> optimized):
60.2 us/op -> 1.5 us/op (41x faster)
2 allocs -> 0 allocs (100% fewer)
6168 B/op -> 0 B/op (100% less memory)
Round-trip (baseline -> optimized, pooled):
147.8 us/op -> 3.1 us/op (48x faster)
3083 allocs -> 2 allocs (99.94% fewer)
34800 B/op -> 48 B/op (99.9% less memory)
Throughput: 80 MB/s -> 3.5 GB/s (geomean, +2900%)
New test files:
- marshal_vector_test.go: 58+ unit subtests across 13 categories
(round-trip, byte-compat, slice-reuse, nil, dimension-mismatch,
empty-vector, pointer-to-slice, special-values, pool-concurrency,
oversized-not-pooled, fixed-elem-size, generic-prealloc).
- vector_bench_test.go: extended with int32/int64 and pooled benchmarks.
- tests/bench/bench_vector_public_test.go: public API benchmarks for
int32/int64 marshal/unmarshal.
Subsumes PR #744 (float fast paths) and PR #745 (generic prealloc).
Extends with int32/int64 fast paths and buffer pooling not covered by
any existing PR.1 parent 04c41b0 commit c5fb0b2
4 files changed
Lines changed: 2458 additions & 1 deletion
0 commit comments