(Improvement) receive path performance improvements - NOT for MERGE#699
(Improvement) receive path performance improvements - NOT for MERGE#699mykaul wants to merge 11 commits intoscylladb:masterfrom
Conversation
|
Results: |
And of course, that was not as straightforward - need to fix it for older Scylla releases. Will do. |
f90fadf to
500329a
Compare
Fixed! |
There was a problem hiding this comment.
Pull request overview
This pull request introduces performance optimizations for the receive and parsing path in the GoCQL driver, with a focus on reducing allocations and improving efficiency when reading query results.
Changes:
- Added fast-path optimizations for common type allocations in
NewWithError()methods, avoiding reflection overhead for frequently used types - Optimized
RowData()to pre-allocate slices with the correct size and use direct indexing instead of append operations - Improved frame reading performance through single-call header reads, larger default buffer size (128→4096 bytes), early returns for simple types, and "happy path first" restructuring of buffer read functions
- Introduced protocol-level optimization that skips redundant keyspace/table reads when FlagGlobalTableSpec is not set, assuming all columns share the same keyspace/table values
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| marshal.go | Adds fast-path type allocations for NativeType, CollectionType, and TupleTypeInfo to avoid reflection overhead for common types |
| helpers_bench_test.go | New benchmark suite for measuring RowData() performance with various column counts and types |
| helpers.go | Optimizes RowData() by pre-sizing slices and using direct indexing instead of append |
| frame.go | Multiple optimizations: single-call header reading, larger buffer size, early return for simple types, happy-path-first buffer reads, skipString() helper, and keyspace/table read optimization |
| conn.go | Conditional time tracking to avoid time.Now() calls when no frameObserver is present, and refactored stream handling |
1. No need to call time.Now() twice if there's no frameObserver configured. By default, there isn't one configured. 2. streamline the 'if' checks for stream - if it's <= 0, build a frame. If it's -1, it's an event message. At least 1 less if in the common case. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
1. We can read the whole header at once - 9 bytes. No point in reading just 1 byte, then more - 1 less call to io.readFull() 2. Parsing and checking version field can be done once. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
128 is really really small. Most results are likely larger. 4K is reasonable (and happens to be the same size as in the Python driver) Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Use pre-sizing and direct indexing in RowData() to improve
performance and reduce allocations, especially for queries with tuples.
Changes:
- Pre-size slices using iter.meta.actualColCount instead of len(iter.Columns())
to account for tuple expansion, eliminating reallocation
- Replace append operations with direct slice indexing to avoid bounds
checking and length update overhead
Performance improvements (measured on Intel i7-1270P, single core):
- Regular columns: 2-4% faster across all column counts
- Tuple columns: 12-13% faster with 128 bytes less memory and 2 fewer
allocations per RowData() call
Benchmark results:
Baseline Optimized Improvement
BenchmarkRowData 1122 ns/op 1092 ns/op 2.7% faster
BenchmarkRowDataWithTuples 1556 ns/op 1369 ns/op 12.0% faster
616 B, 20 allocs 488 B, 18 allocs
BenchmarkRowDataAllocation/100cols 10580 ns/op 10130 ns/op 4.3% faster
BenchmarkRowDataAllocation/WithTuples 1602 ns/op 1390 ns/op 13.2% faster
616 B, 20 allocs 488 B, 18 allocs
Added comprehensive benchmark suite in helpers_bench_test.go to measure
RowData() performance across various column counts and data types.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Add fast-path type instantiation for all native CQL types to avoid
expensive reflection calls when creating column value holders.
Changes:
- Added type switch in NativeType.NewWithError() with direct instantiation
for all 17 native types (int, bigint, text, uuid, timestamp, etc.)
- Falls back to reflection only for TypeCustom and complex types
- Added required imports: time, math/big, gopkg.in/inf.v0
Performance improvements (measured on Intel i7-1270P, single core):
Combined with previous pre-sizing/indexing optimization, achieves:
- 35-66% faster RowData() across all workloads
- 40-47% less memory allocation
- 35-50% fewer allocations
The reflection elimination provides 50-65% speedup on top of the
pre-sizing optimization, with improvement scaling with column count.
Benchmark comparison (baseline → optimized):
BenchmarkRowData 1122 ns/op → 463 ns/op (58.7% faster)
720 B, 22 allocs → 400 B, 12 allocs
BenchmarkRowDataLarge 5220 ns/op → 1865 ns/op (64.3% faster)
3792 B, 102 allocs → 2192 B, 52 allocs
BenchmarkRowDataAllocation/100cols 10580 ns/op → 3624 ns/op (65.7% faster)
7584 B, 202 allocs → 4384 B, 102 allocs
BenchmarkRowDataAllocation/1000cols 103552 ns/op → 35607 ns/op (65.6% faster)
72768 B, 2002 allocs → 40768 B, 1002 allocs
Every column in RowData() calls NewWithError(), making this optimization
highly impactful for queries with many columns. The improvement compounds
with the previous commit's pre-sizing and direct indexing changes.
Same improvement can be done (in a separate PR) to collections and tuples
(and their NewWithError() functions)
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
…d tuple types
Similiar to previous commit:
Add type-specific fast paths in CollectionType.NewWithError() and
TupleTypeInfo.NewWithError() to avoid expensive reflection calls during
row data allocation.
Changes:
- CollectionType.NewWithError(): Fast paths for common patterns:
* Lists/sets: []int, []int64, []string, []bool, []float32, []float64,
[]UUID, []time.Time, []int16, []int8, [][]byte
* Maps: map[string]int, map[string]int64, map[string]string,
map[string]bool, map[string]float64, map[string]UUID,
map[int]string, map[int]int
* Falls back to reflection for complex nested collections
- TupleTypeInfo.NewWithError(): Simplified to always return
new([]interface{}) since tuples unmarshal to []interface{} regardless
of element types, completely eliminating reflection
(Note - we may need to think of moving from interface to any? )
Performance impact:
- Tuple-heavy queries: ~3% faster (1047→1017 ns/op)
- Maintains performance for primitive-heavy workloads
- Part of broader RowData() optimization series:
* Combined improvements: 58.7% faster overall
* Memory: -44% (720→400 B/op)
* Allocations: -45% (22→12 allocs/op)
Benchmarks show targeted benefits for queries using collections and
tuples while preserving fast-path performance for queries dominated by
native types.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
…pace/table once Store keyspace and table in resultMetadata/preparedMetadata structs instead of re-reading or re-assigning for every column. Changes: - Add keyspace/table fields to resultMetadata struct (matching preparedMetadata) - Read keyspace/table once before column loop for both globalSpec paths: * globalSpec=true: read from metadata position (wire protocol global spec) * !globalSpec: read from first column position, skip redundant reads for remaining columns - Add skipString() helper to efficiently skip wire strings without allocation - Simplify readCol(): eliminate isFirstCol conditional, always skip for !globalSpec Benefits: - Eliminates N branch checks in hot loop (one per column when globalSpec=true) - Eliminates (N-1)×2 string allocations when !globalSpec (skip instead of read) - Saves 16 bytes per column (2 string headers) by deduplicating storage - Maintains API compatibility: ColumnInfo.Keyspace/Table unchanged Wire protocol correctness: - globalSpec=true: keyspace/table sent once at metadata level - !globalSpec: keyspace/table sent per-column (protocol requires it even when identical) - Both cases: read once, store in metadata, reference in columns. NOTE: this is against the protocol in theory, and correct in practice. No results are ever sent from different keyspace/table. Performance: ~2-5% improvement expected for wide tables (>20 columns) with minimal code complexity. Benchmarks show no regression for typical workloads. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
… types
Add early-return optimization for simple native types (0x0001-0x0015)
in readTypeInfo(). These types (int, text, bigint, timestamp, UUID, etc.)
represent most of columns in typical queries and need no further processing
beyond creating the NativeType struct.
Changes:
- Add fast-path check: if id > 0 && id <= 0x0015 { return simple }
- Eliminates TypeCustom branch check for native types
- Skips entire switch statement for collection/UDT/tuple types
- Simplify readCol(): convert single-case switch to type assertion
Benefits:
- Reduces hot-path branching for most common type IDs
- More idiomatic Go code (type assertion vs single-case switch)
- No performance regression in benchmarks (464.6 ns/op vs 465.5 ns/op baseline)
The optimization primarily benefits metadata parsing during prepared statement
execution and result frame processing, especially with DisableSkipMetadata=true
(the default).
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
…h prediction Convert all framer read* methods from error-first to happy-path-first pattern to improve CPU branch prediction. Modern processors predict forward branches (common case) more efficiently than backward branches (error paths). Since there's no `err != nil` checks that the compiler knows well to handle with branch prediction, this is a reasonable change. Changes: - Reorder all read* functions: check buffer has sufficient data first, process and return on success, panic on error path - Affects: readByte, readInt, readShort, readString, skipString, readLongString, ReadBytesInternal, readBytes, readBytesCopy, readShortBytes, readInetAdressOnly - No functional changes, only reordering of conditionals Benchmark: BenchmarkRowData shows stable performance at 463.0 ns/op (no regression from 464.6 ns/op baseline). This is unlikely to be seen in most cases, but also reasonable for readability. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
500329a to
5ee3be2
Compare
|
@mykaul We were bitten by this in Rust Driver before: scylladb/scylla-rust-driver#1134 |
OK - thanks! Not only I need to fix it, but I also believe we must add a test for this! |
Fixed - and added a test. |
|
Performance Progression: Baseline → Pre-sizing+Indexing → +Reflection Elimination
|
|
(Note - need to repeat the tests, will do later) |
|
Running |
This is a collection of partially independent and partially dependant (I can, if needed, extract specific ones to their own PRs) optimizations focused on the receive and parsing results path.
They are mostly straightforward and each has its own somewhat lengthy commit message explaining the change.
On the simple benchmark that was added as well (I always ran it o my laptop with
go test -bench=BenchmarkRowData -benchmem -run=^$ -cpu=1 .), it shows a very nice performance improvement.Of course, in real life, latency and other stuff may dominate, but it's a good series, I think.
Pay attention to ca6c1ea where I cheat and violate the protocol. If approved, I think we should do the same for other drivers as well.