Skip to content

(Improvement) receive path performance improvements - NOT for MERGE#699

Draft
mykaul wants to merge 11 commits intoscylladb:masterfrom
mykaul:rec_perf_improvements
Draft

(Improvement) receive path performance improvements - NOT for MERGE#699
mykaul wants to merge 11 commits intoscylladb:masterfrom
mykaul:rec_perf_improvements

Conversation

@mykaul
Copy link
Copy Markdown

@mykaul mykaul commented Feb 5, 2026

This is a collection of partially independent and partially dependant (I can, if needed, extract specific ones to their own PRs) optimizations focused on the receive and parsing results path.

They are mostly straightforward and each has its own somewhat lengthy commit message explaining the change.

On the simple benchmark that was added as well (I always ran it o my laptop with go test -bench=BenchmarkRowData -benchmem -run=^$ -cpu=1 . ), it shows a very nice performance improvement.
Of course, in real life, latency and other stuff may dominate, but it's a good series, I think.

Pay attention to ca6c1ea where I cheat and violate the protocol. If approved, I think we should do the same for other drivers as well.

@mykaul mykaul changed the title (Improvement) receive path performance improvement (Improvement) receive path performance improvements Feb 5, 2026
@mykaul
Copy link
Copy Markdown
Author

mykaul commented Feb 5, 2026

Results:


Performance Progression: Baseline → Pre-sizing+Indexing → +Reflection Elimination
Benchmark	Baseline	After Commit 1	Current	Total Improvement
RowData	1122 ns/op<br>720 B, 22 allocs	1092 ns/op<br>720 B, 22 allocs	463.3 ns/op<br>400 B, 12 allocs	🚀 58.7% faster<br>-44% memory, -45% allocs
RowDataSmall	380.7 ns/op<br>216 B, 8 allocs	370.7 ns/op<br>216 B, 8 allocs	180.6 ns/op<br>120 B, 5 allocs	🚀 52.6% faster<br>-44% memory, -38% allocs
RowDataLarge	5220 ns/op<br>3792 B, 102 allocs	5074 ns/op<br>3792 B, 102 allocs	1865 ns/op<br>2192 B, 52 allocs	🚀 64.3% faster<br>-42% memory, -49% allocs
RowDataWithTypes	1216 ns/op<br>776 B, 22 allocs	1174 ns/op<br>776 B, 22 allocs	533.9 ns/op<br>456 B, 12 allocs	🚀 56.1% faster<br>-41% memory, -45% allocs
RowDataWithTuples	1556 ns/op<br>616 B, 20 allocs	1369 ns/op<br>488 B, 18 allocs	1017 ns/op<br>328 B, 13 allocs	🚀 34.6% faster<br>-47% memory, -35% allocs
RowDataRepeated	112088 ns/op<br>72000 B, 2200 allocs	108273 ns/op<br>72000 B, 2200 allocs	46064 ns/op<br>40000 B, 1200 allocs	🚀 58.9% faster<br>-44% memory, -45% allocs
10cols	1144 ns/op<br>720 B, 22 allocs	1109 ns/op<br>720 B, 22 allocs	470.8 ns/op<br>400 B, 12 allocs	🚀 58.9% faster<br>-44% memory, -45% allocs
100cols	10580 ns/op<br>7584 B, 202 allocs	10130 ns/op<br>7584 B, 202 allocs	3624 ns/op<br>4384 B, 102 allocs	🚀 65.7% faster<br>-42% memory, -50% allocs
1000cols	103552 ns/op<br>72768 B, 2002 allocs	101583 ns/op<br>72768 B, 2002 allocs	35607 ns/op<br>40768 B, 1002 allocs	🚀 65.6% faster<br>-44% memory, -50% allocs
WithTuples	1602 ns/op<br>616 B, 20 allocs	1390 ns/op<br>488 B, 18 allocs	1047 ns/op<br>328 B, 13 allocs	🚀 34.6% faster<br>-47% memory, -35% allocs

@mykaul mykaul marked this pull request as draft February 5, 2026 08:51
@mykaul
Copy link
Copy Markdown
Author

mykaul commented Feb 5, 2026

Pay attention to ca6c1ea where I cheat and violate the protocol. If approved, I think we should do the same for other drivers as well.

And of course, that was not as straightforward - need to fix it for older Scylla releases. Will do.

@mykaul mykaul force-pushed the rec_perf_improvements branch from f90fadf to 500329a Compare February 5, 2026 09:42
@mykaul
Copy link
Copy Markdown
Author

mykaul commented Feb 5, 2026

Pay attention to ca6c1ea where I cheat and violate the protocol. If approved, I think we should do the same for other drivers as well.

And of course, that was not as straightforward - need to fix it for older Scylla releases. Will do.

Fixed!

@mykaul mykaul marked this pull request as ready for review February 5, 2026 10:59
@mykaul mykaul requested a review from Copilot February 5, 2026 19:04
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request introduces performance optimizations for the receive and parsing path in the GoCQL driver, with a focus on reducing allocations and improving efficiency when reading query results.

Changes:

  • Added fast-path optimizations for common type allocations in NewWithError() methods, avoiding reflection overhead for frequently used types
  • Optimized RowData() to pre-allocate slices with the correct size and use direct indexing instead of append operations
  • Improved frame reading performance through single-call header reads, larger default buffer size (128→4096 bytes), early returns for simple types, and "happy path first" restructuring of buffer read functions
  • Introduced protocol-level optimization that skips redundant keyspace/table reads when FlagGlobalTableSpec is not set, assuming all columns share the same keyspace/table values

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
marshal.go Adds fast-path type allocations for NativeType, CollectionType, and TupleTypeInfo to avoid reflection overhead for common types
helpers_bench_test.go New benchmark suite for measuring RowData() performance with various column counts and types
helpers.go Optimizes RowData() by pre-sizing slices and using direct indexing instead of append
frame.go Multiple optimizations: single-call header reading, larger buffer size, early return for simple types, happy-path-first buffer reads, skipString() helper, and keyspace/table read optimization
conn.go Conditional time tracking to avoid time.Now() calls when no frameObserver is present, and refactored stream handling

Comment thread frame.go Outdated
Comment thread frame.go Outdated
mykaul added 10 commits February 5, 2026 23:53
1. No need to call time.Now() twice if there's no frameObserver configured.
By default, there isn't one configured.

2. streamline the 'if' checks for stream - if it's <= 0, build a frame.
If it's -1, it's an event message. At least 1 less if in the common case.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
1. We can read the whole header at once - 9 bytes.
No point in reading just 1 byte, then more - 1 less call to io.readFull()

2. Parsing and checking version field can be done once.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
128 is really really small. Most results are likely larger.
4K is reasonable (and happens to be the same size as in the Python driver)

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Use pre-sizing and direct indexing in RowData() to improve
performance and reduce allocations, especially for queries with tuples.

Changes:
- Pre-size slices using iter.meta.actualColCount instead of len(iter.Columns())
  to account for tuple expansion, eliminating reallocation
- Replace append operations with direct slice indexing to avoid bounds
  checking and length update overhead

Performance improvements (measured on Intel i7-1270P, single core):
- Regular columns: 2-4% faster across all column counts
- Tuple columns: 12-13% faster with 128 bytes less memory and 2 fewer
  allocations per RowData() call

Benchmark results:
                              Baseline        Optimized       Improvement
BenchmarkRowData              1122 ns/op      1092 ns/op      2.7% faster
BenchmarkRowDataWithTuples    1556 ns/op      1369 ns/op      12.0% faster
                              616 B, 20 allocs 488 B, 18 allocs
BenchmarkRowDataAllocation/100cols   10580 ns/op     10130 ns/op     4.3% faster
BenchmarkRowDataAllocation/WithTuples 1602 ns/op     1390 ns/op      13.2% faster
                              616 B, 20 allocs 488 B, 18 allocs

Added comprehensive benchmark suite in helpers_bench_test.go to measure
RowData() performance across various column counts and data types.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Add fast-path type instantiation for all native CQL types to avoid
expensive reflection calls when creating column value holders.

Changes:
- Added type switch in NativeType.NewWithError() with direct instantiation
  for all 17 native types (int, bigint, text, uuid, timestamp, etc.)
- Falls back to reflection only for TypeCustom and complex types
- Added required imports: time, math/big, gopkg.in/inf.v0

Performance improvements (measured on Intel i7-1270P, single core):
Combined with previous pre-sizing/indexing optimization, achieves:
- 35-66% faster RowData() across all workloads
- 40-47% less memory allocation
- 35-50% fewer allocations

The reflection elimination provides 50-65% speedup on top of the
pre-sizing optimization, with improvement scaling with column count.

Benchmark comparison (baseline → optimized):
BenchmarkRowData              1122 ns/op → 463 ns/op   (58.7% faster)
                              720 B, 22 allocs → 400 B, 12 allocs
BenchmarkRowDataLarge         5220 ns/op → 1865 ns/op  (64.3% faster)
                              3792 B, 102 allocs → 2192 B, 52 allocs
BenchmarkRowDataAllocation/100cols   10580 ns/op → 3624 ns/op (65.7% faster)
                              7584 B, 202 allocs → 4384 B, 102 allocs
BenchmarkRowDataAllocation/1000cols  103552 ns/op → 35607 ns/op (65.6% faster)
                              72768 B, 2002 allocs → 40768 B, 1002 allocs

Every column in RowData() calls NewWithError(), making this optimization
highly impactful for queries with many columns. The improvement compounds
with the previous commit's pre-sizing and direct indexing changes.

Same improvement can be done (in a separate PR) to collections and tuples
(and their NewWithError() functions)

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
…d tuple types

Similiar to previous commit:
Add type-specific fast paths in CollectionType.NewWithError() and
TupleTypeInfo.NewWithError() to avoid expensive reflection calls during
row data allocation.

Changes:
- CollectionType.NewWithError(): Fast paths for common patterns:
  * Lists/sets: []int, []int64, []string, []bool, []float32, []float64,
    []UUID, []time.Time, []int16, []int8, [][]byte
  * Maps: map[string]int, map[string]int64, map[string]string,
    map[string]bool, map[string]float64, map[string]UUID,
    map[int]string, map[int]int
  * Falls back to reflection for complex nested collections

- TupleTypeInfo.NewWithError(): Simplified to always return
  new([]interface{}) since tuples unmarshal to []interface{} regardless
  of element types, completely eliminating reflection

(Note - we may need to think of moving from interface to any? )

Performance impact:
- Tuple-heavy queries: ~3% faster (1047→1017 ns/op)
- Maintains performance for primitive-heavy workloads
- Part of broader RowData() optimization series:
  * Combined improvements: 58.7% faster overall
  * Memory: -44% (720→400 B/op)
  * Allocations: -45% (22→12 allocs/op)

Benchmarks show targeted benefits for queries using collections and
tuples while preserving fast-path performance for queries dominated by
native types.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
…pace/table once

Store keyspace and table in resultMetadata/preparedMetadata structs
instead of re-reading or re-assigning for every column.

Changes:
- Add keyspace/table fields to resultMetadata struct (matching preparedMetadata)
- Read keyspace/table once before column loop for both globalSpec paths:
  * globalSpec=true: read from metadata position (wire protocol global spec)
  * !globalSpec: read from first column position, skip redundant reads for remaining columns
- Add skipString() helper to efficiently skip wire strings without allocation
- Simplify readCol(): eliminate isFirstCol conditional, always skip for !globalSpec

Benefits:
- Eliminates N branch checks in hot loop (one per column when globalSpec=true)
- Eliminates (N-1)×2 string allocations when !globalSpec (skip instead of read)
- Saves 16 bytes per column (2 string headers) by deduplicating storage
- Maintains API compatibility: ColumnInfo.Keyspace/Table unchanged

Wire protocol correctness:
- globalSpec=true: keyspace/table sent once at metadata level
- !globalSpec: keyspace/table sent per-column (protocol requires it even when identical)
- Both cases: read once, store in metadata, reference in columns.
NOTE: this is against the protocol in theory, and correct in practice. No results are ever
sent from different keyspace/table.

Performance: ~2-5% improvement expected for wide tables (>20 columns) with
minimal code complexity. Benchmarks show no regression for typical workloads.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
… types

Add early-return optimization for simple native types (0x0001-0x0015)
in readTypeInfo(). These types (int, text, bigint, timestamp, UUID, etc.)
represent most of columns in typical queries and need no further processing
beyond creating the NativeType struct.

Changes:
- Add fast-path check: if id > 0 && id <= 0x0015 { return simple }
- Eliminates TypeCustom branch check for native types
- Skips entire switch statement for collection/UDT/tuple types
- Simplify readCol(): convert single-case switch to type assertion

Benefits:
- Reduces hot-path branching for most common type IDs
- More idiomatic Go code (type assertion vs single-case switch)
- No performance regression in benchmarks (464.6 ns/op vs 465.5 ns/op baseline)

The optimization primarily benefits metadata parsing during prepared statement
execution and result frame processing, especially with DisableSkipMetadata=true
(the default).

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
…h prediction

Convert all framer read* methods from error-first to happy-path-first
pattern to improve CPU branch prediction. Modern processors predict
forward branches (common case) more efficiently than backward branches
(error paths).

Since there's no `err != nil` checks that the compiler knows well to
handle with branch prediction, this is a reasonable change.

Changes:
- Reorder all read* functions: check buffer has sufficient data first,
  process and return on success, panic on error path
- Affects: readByte, readInt, readShort, readString, skipString,
  readLongString, ReadBytesInternal, readBytes, readBytesCopy,
  readShortBytes, readInetAdressOnly
- No functional changes, only reordering of conditionals

Benchmark: BenchmarkRowData shows stable performance at 463.0 ns/op
(no regression from 464.6 ns/op baseline).

This is unlikely to be seen in most cases, but also reasonable for readability.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
@mykaul mykaul force-pushed the rec_perf_improvements branch from 500329a to 5ee3be2 Compare February 6, 2026 09:11
@Lorak-mmk
Copy link
Copy Markdown

@mykaul
Regarding protocol violation: where exactly is your optimization used?
It is fine to do such optimization for result metadata (present in ROWS and PREPARED) responses.
It is however NOT fine to do this for prepared metadata (metadata about prepared statement bound variables).
Such metadata can actually contain different table specs for different vars, when a user prepares a textual batch statement that contains statements operating on different tables.

We were bitten by this in Rust Driver before: scylladb/scylla-rust-driver#1134

@mykaul mykaul marked this pull request as draft February 6, 2026 12:21
@mykaul
Copy link
Copy Markdown
Author

mykaul commented Feb 9, 2026

@mykaul Regarding protocol violation: where exactly is your optimization used? It is fine to do such optimization for result metadata (present in ROWS and PREPARED) responses. It is however NOT fine to do this for prepared metadata (metadata about prepared statement bound variables). Such metadata can actually contain different table specs for different vars, when a user prepares a textual batch statement that contains statements operating on different tables.

We were bitten by this in Rust Driver before: scylladb/scylla-rust-driver#1134

OK - thanks! Not only I need to fix it, but I also believe we must add a test for this!

@mykaul
Copy link
Copy Markdown
Author

mykaul commented Feb 10, 2026

@mykaul Regarding protocol violation: where exactly is your optimization used? It is fine to do such optimization for result metadata (present in ROWS and PREPARED) responses. It is however NOT fine to do this for prepared metadata (metadata about prepared statement bound variables). Such metadata can actually contain different table specs for different vars, when a user prepares a textual batch statement that contains statements operating on different tables.
We were bitten by this in Rust Driver before: scylladb/scylla-rust-driver#1134

OK - thanks! Not only I need to fix it, but I also believe we must add a test for this!

Fixed - and added a test.

@mykaul
Copy link
Copy Markdown
Author

mykaul commented Feb 13, 2026

Performance Progression: Baseline → Pre-sizing+Indexing → +Reflection Elimination

Benchmark Baseline After Commit 1 Current Total Improvement
RowData 1122 ns/op
720 B, 22 allocs
1092 ns/op
720 B, 22 allocs
463.3 ns/op
400 B, 12 allocs
🚀 58.7% faster
-44% memory, -45% allocs
RowDataSmall 380.7 ns/op
216 B, 8 allocs
370.7 ns/op
216 B, 8 allocs
180.6 ns/op
120 B, 5 allocs
🚀 52.6% faster
-44% memory, -38% allocs
RowDataLarge 5220 ns/op
3792 B, 102 allocs
5074 ns/op
3792 B, 102 allocs
1865 ns/op
2192 B, 52 allocs
🚀 64.3% faster
-42% memory, -49% allocs
RowDataWithTypes 1216 ns/op
776 B, 22 allocs
1174 ns/op
776 B, 22 allocs
533.9 ns/op
456 B, 12 allocs
🚀 56.1% faster
-41% memory, -45% allocs
RowDataWithTuples 1556 ns/op
616 B, 20 allocs
1369 ns/op
488 B, 18 allocs
1017 ns/op
328 B, 13 allocs
🚀 34.6% faster
-47% memory, -35% allocs
RowDataRepeated 112088 ns/op
72000 B, 2200 allocs
108273 ns/op
72000 B, 2200 allocs
46064 ns/op
40000 B, 1200 allocs
🚀 58.9% faster
-44% memory, -45% allocs
10cols 1144 ns/op
720 B, 22 allocs
1109 ns/op
720 B, 22 allocs
470.8 ns/op
400 B, 12 allocs
🚀 58.9% faster
-44% memory, -45% allocs
100cols 10580 ns/op
7584 B, 202 allocs
10130 ns/op
7584 B, 202 allocs
3624 ns/op
4384 B, 102 allocs
🚀 65.7% faster
-42% memory, -50% allocs
1000cols 103552 ns/op
72768 B, 2002 allocs
101583 ns/op
72768 B, 2002 allocs
35607 ns/op
40768 B, 1002 allocs
🚀 65.6% faster
-44% memory, -50% allocs
WithTuples 1602 ns/op
616 B, 20 allocs
1390 ns/op
488 B, 18 allocs
1047 ns/op
328 B, 13 allocs
🚀 34.6% faster
-47% memory, -35% allocs

@mykaul
Copy link
Copy Markdown
Author

mykaul commented Feb 13, 2026

(Note - need to repeat the tests, will do later)

@mykaul
Copy link
Copy Markdown
Author

mykaul commented Feb 14, 2026

Running go test -run=^$ -bench=Wiki -benchmem -tags integration -count=1 -cpu=1 -cluster=127.0.2.1 . did not convince me there's a winner here. I wonder if my benchmark isn't good, or I may not be testing well (most likely - my Scylla nodes may be running wherever and it's all on my laptop anyway).

@mykaul mykaul changed the title (Improvement) receive path performance improvements (Improvement) receive path performance improvements - NOT for MERGE Mar 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants