(Improvement) receive path performance improvements - NOT for MERGE by mykaul · Pull Request #699 · scylladb/gocql

mykaul · 2026-02-05T08:23:11Z

This is a collection of partially independent and partially dependant (I can, if needed, extract specific ones to their own PRs) optimizations focused on the receive and parsing results path.

They are mostly straightforward and each has its own somewhat lengthy commit message explaining the change.

On the simple benchmark that was added as well (I always ran it o my laptop with go test -bench=BenchmarkRowData -benchmem -run=^$ -cpu=1 . ), it shows a very nice performance improvement.
Of course, in real life, latency and other stuff may dominate, but it's a good series, I think.

Pay attention to ca6c1ea where I cheat and violate the protocol. If approved, I think we should do the same for other drivers as well.

mykaul · 2026-02-05T08:25:39Z

Results:


Performance Progression: Baseline → Pre-sizing+Indexing → +Reflection Elimination
Benchmark	Baseline	After Commit 1	Current	Total Improvement
RowData	1122 ns/op<br>720 B, 22 allocs	1092 ns/op<br>720 B, 22 allocs	463.3 ns/op<br>400 B, 12 allocs	🚀 58.7% faster<br>-44% memory, -45% allocs
RowDataSmall	380.7 ns/op<br>216 B, 8 allocs	370.7 ns/op<br>216 B, 8 allocs	180.6 ns/op<br>120 B, 5 allocs	🚀 52.6% faster<br>-44% memory, -38% allocs
RowDataLarge	5220 ns/op<br>3792 B, 102 allocs	5074 ns/op<br>3792 B, 102 allocs	1865 ns/op<br>2192 B, 52 allocs	🚀 64.3% faster<br>-42% memory, -49% allocs
RowDataWithTypes	1216 ns/op<br>776 B, 22 allocs	1174 ns/op<br>776 B, 22 allocs	533.9 ns/op<br>456 B, 12 allocs	🚀 56.1% faster<br>-41% memory, -45% allocs
RowDataWithTuples	1556 ns/op<br>616 B, 20 allocs	1369 ns/op<br>488 B, 18 allocs	1017 ns/op<br>328 B, 13 allocs	🚀 34.6% faster<br>-47% memory, -35% allocs
RowDataRepeated	112088 ns/op<br>72000 B, 2200 allocs	108273 ns/op<br>72000 B, 2200 allocs	46064 ns/op<br>40000 B, 1200 allocs	🚀 58.9% faster<br>-44% memory, -45% allocs
10cols	1144 ns/op<br>720 B, 22 allocs	1109 ns/op<br>720 B, 22 allocs	470.8 ns/op<br>400 B, 12 allocs	🚀 58.9% faster<br>-44% memory, -45% allocs
100cols	10580 ns/op<br>7584 B, 202 allocs	10130 ns/op<br>7584 B, 202 allocs	3624 ns/op<br>4384 B, 102 allocs	🚀 65.7% faster<br>-42% memory, -50% allocs
1000cols	103552 ns/op<br>72768 B, 2002 allocs	101583 ns/op<br>72768 B, 2002 allocs	35607 ns/op<br>40768 B, 1002 allocs	🚀 65.6% faster<br>-44% memory, -50% allocs
WithTuples	1602 ns/op<br>616 B, 20 allocs	1390 ns/op<br>488 B, 18 allocs	1047 ns/op<br>328 B, 13 allocs	🚀 34.6% faster<br>-47% memory, -35% allocs

mykaul · 2026-02-05T08:52:09Z

Pay attention to ca6c1ea where I cheat and violate the protocol. If approved, I think we should do the same for other drivers as well.

And of course, that was not as straightforward - need to fix it for older Scylla releases. Will do.

mykaul · 2026-02-05T10:59:13Z

Pay attention to ca6c1ea where I cheat and violate the protocol. If approved, I think we should do the same for other drivers as well.

And of course, that was not as straightforward - need to fix it for older Scylla releases. Will do.

Fixed!

Copilot

Pull request overview

This pull request introduces performance optimizations for the receive and parsing path in the GoCQL driver, with a focus on reducing allocations and improving efficiency when reading query results.

Changes:

Added fast-path optimizations for common type allocations in NewWithError() methods, avoiding reflection overhead for frequently used types
Optimized RowData() to pre-allocate slices with the correct size and use direct indexing instead of append operations
Improved frame reading performance through single-call header reads, larger default buffer size (128→4096 bytes), early returns for simple types, and "happy path first" restructuring of buffer read functions
Introduced protocol-level optimization that skips redundant keyspace/table reads when FlagGlobalTableSpec is not set, assuming all columns share the same keyspace/table values

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
marshal.go	Adds fast-path type allocations for NativeType, CollectionType, and TupleTypeInfo to avoid reflection overhead for common types
helpers_bench_test.go	New benchmark suite for measuring RowData() performance with various column counts and types
helpers.go	Optimizes RowData() by pre-sizing slices and using direct indexing instead of append
frame.go	Multiple optimizations: single-call header reading, larger buffer size, early return for simple types, happy-path-first buffer reads, skipString() helper, and keyspace/table read optimization
conn.go	Conditional time tracking to avoid time.Now() calls when no frameObserver is present, and refactored stream handling

1. No need to call time.Now() twice if there's no frameObserver configured. By default, there isn't one configured. 2. streamline the 'if' checks for stream - if it's <= 0, build a frame. If it's -1, it's an event message. At least 1 less if in the common case. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

1. We can read the whole header at once - 9 bytes. No point in reading just 1 byte, then more - 1 less call to io.readFull() 2. Parsing and checking version field can be done once. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

128 is really really small. Most results are likely larger. 4K is reasonable (and happens to be the same size as in the Python driver) Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Use pre-sizing and direct indexing in RowData() to improve performance and reduce allocations, especially for queries with tuples. Changes: - Pre-size slices using iter.meta.actualColCount instead of len(iter.Columns()) to account for tuple expansion, eliminating reallocation - Replace append operations with direct slice indexing to avoid bounds checking and length update overhead Performance improvements (measured on Intel i7-1270P, single core): - Regular columns: 2-4% faster across all column counts - Tuple columns: 12-13% faster with 128 bytes less memory and 2 fewer allocations per RowData() call Benchmark results: Baseline Optimized Improvement BenchmarkRowData 1122 ns/op 1092 ns/op 2.7% faster BenchmarkRowDataWithTuples 1556 ns/op 1369 ns/op 12.0% faster 616 B, 20 allocs 488 B, 18 allocs BenchmarkRowDataAllocation/100cols 10580 ns/op 10130 ns/op 4.3% faster BenchmarkRowDataAllocation/WithTuples 1602 ns/op 1390 ns/op 13.2% faster 616 B, 20 allocs 488 B, 18 allocs Added comprehensive benchmark suite in helpers_bench_test.go to measure RowData() performance across various column counts and data types. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Add fast-path type instantiation for all native CQL types to avoid expensive reflection calls when creating column value holders. Changes: - Added type switch in NativeType.NewWithError() with direct instantiation for all 17 native types (int, bigint, text, uuid, timestamp, etc.) - Falls back to reflection only for TypeCustom and complex types - Added required imports: time, math/big, gopkg.in/inf.v0 Performance improvements (measured on Intel i7-1270P, single core): Combined with previous pre-sizing/indexing optimization, achieves: - 35-66% faster RowData() across all workloads - 40-47% less memory allocation - 35-50% fewer allocations The reflection elimination provides 50-65% speedup on top of the pre-sizing optimization, with improvement scaling with column count. Benchmark comparison (baseline → optimized): BenchmarkRowData 1122 ns/op → 463 ns/op (58.7% faster) 720 B, 22 allocs → 400 B, 12 allocs BenchmarkRowDataLarge 5220 ns/op → 1865 ns/op (64.3% faster) 3792 B, 102 allocs → 2192 B, 52 allocs BenchmarkRowDataAllocation/100cols 10580 ns/op → 3624 ns/op (65.7% faster) 7584 B, 202 allocs → 4384 B, 102 allocs BenchmarkRowDataAllocation/1000cols 103552 ns/op → 35607 ns/op (65.6% faster) 72768 B, 2002 allocs → 40768 B, 1002 allocs Every column in RowData() calls NewWithError(), making this optimization highly impactful for queries with many columns. The improvement compounds with the previous commit's pre-sizing and direct indexing changes. Same improvement can be done (in a separate PR) to collections and tuples (and their NewWithError() functions) Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

…d tuple types Similiar to previous commit: Add type-specific fast paths in CollectionType.NewWithError() and TupleTypeInfo.NewWithError() to avoid expensive reflection calls during row data allocation. Changes: - CollectionType.NewWithError(): Fast paths for common patterns: * Lists/sets: []int, []int64, []string, []bool, []float32, []float64, []UUID, []time.Time, []int16, []int8, [][]byte * Maps: map[string]int, map[string]int64, map[string]string, map[string]bool, map[string]float64, map[string]UUID, map[int]string, map[int]int * Falls back to reflection for complex nested collections - TupleTypeInfo.NewWithError(): Simplified to always return new([]interface{}) since tuples unmarshal to []interface{} regardless of element types, completely eliminating reflection (Note - we may need to think of moving from interface to any? ) Performance impact: - Tuple-heavy queries: ~3% faster (1047→1017 ns/op) - Maintains performance for primitive-heavy workloads - Part of broader RowData() optimization series: * Combined improvements: 58.7% faster overall * Memory: -44% (720→400 B/op) * Allocations: -45% (22→12 allocs/op) Benchmarks show targeted benefits for queries using collections and tuples while preserving fast-path performance for queries dominated by native types. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

…pace/table once Store keyspace and table in resultMetadata/preparedMetadata structs instead of re-reading or re-assigning for every column. Changes: - Add keyspace/table fields to resultMetadata struct (matching preparedMetadata) - Read keyspace/table once before column loop for both globalSpec paths: * globalSpec=true: read from metadata position (wire protocol global spec) * !globalSpec: read from first column position, skip redundant reads for remaining columns - Add skipString() helper to efficiently skip wire strings without allocation - Simplify readCol(): eliminate isFirstCol conditional, always skip for !globalSpec Benefits: - Eliminates N branch checks in hot loop (one per column when globalSpec=true) - Eliminates (N-1)×2 string allocations when !globalSpec (skip instead of read) - Saves 16 bytes per column (2 string headers) by deduplicating storage - Maintains API compatibility: ColumnInfo.Keyspace/Table unchanged Wire protocol correctness: - globalSpec=true: keyspace/table sent once at metadata level - !globalSpec: keyspace/table sent per-column (protocol requires it even when identical) - Both cases: read once, store in metadata, reference in columns. NOTE: this is against the protocol in theory, and correct in practice. No results are ever sent from different keyspace/table. Performance: ~2-5% improvement expected for wide tables (>20 columns) with minimal code complexity. Benchmarks show no regression for typical workloads. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

… types Add early-return optimization for simple native types (0x0001-0x0015) in readTypeInfo(). These types (int, text, bigint, timestamp, UUID, etc.) represent most of columns in typical queries and need no further processing beyond creating the NativeType struct. Changes: - Add fast-path check: if id > 0 && id <= 0x0015 { return simple } - Eliminates TypeCustom branch check for native types - Skips entire switch statement for collection/UDT/tuple types - Simplify readCol(): convert single-case switch to type assertion Benefits: - Reduces hot-path branching for most common type IDs - More idiomatic Go code (type assertion vs single-case switch) - No performance regression in benchmarks (464.6 ns/op vs 465.5 ns/op baseline) The optimization primarily benefits metadata parsing during prepared statement execution and result frame processing, especially with DisableSkipMetadata=true (the default). Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

…h prediction Convert all framer read* methods from error-first to happy-path-first pattern to improve CPU branch prediction. Modern processors predict forward branches (common case) more efficiently than backward branches (error paths). Since there's no `err != nil` checks that the compiler knows well to handle with branch prediction, this is a reasonable change. Changes: - Reorder all read* functions: check buffer has sufficient data first, process and return on success, panic on error path - Affects: readByte, readInt, readShort, readString, skipString, readLongString, ReadBytesInternal, readBytes, readBytesCopy, readShortBytes, readInetAdressOnly - No functional changes, only reordering of conditionals Benchmark: BenchmarkRowData shows stable performance at 463.0 ns/op (no regression from 464.6 ns/op baseline). This is unlikely to be seen in most cases, but also reasonable for readability. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Lorak-mmk · 2026-02-06T09:20:10Z

@mykaul
Regarding protocol violation: where exactly is your optimization used?
It is fine to do such optimization for result metadata (present in ROWS and PREPARED) responses.
It is however NOT fine to do this for prepared metadata (metadata about prepared statement bound variables).
Such metadata can actually contain different table specs for different vars, when a user prepares a textual batch statement that contains statements operating on different tables.

We were bitten by this in Rust Driver before: scylladb/scylla-rust-driver#1134

mykaul · 2026-02-09T08:34:39Z

@mykaul Regarding protocol violation: where exactly is your optimization used? It is fine to do such optimization for result metadata (present in ROWS and PREPARED) responses. It is however NOT fine to do this for prepared metadata (metadata about prepared statement bound variables). Such metadata can actually contain different table specs for different vars, when a user prepares a textual batch statement that contains statements operating on different tables.

We were bitten by this in Rust Driver before: scylladb/scylla-rust-driver#1134

OK - thanks! Not only I need to fix it, but I also believe we must add a test for this!

mykaul · 2026-02-10T20:02:59Z

@mykaul Regarding protocol violation: where exactly is your optimization used? It is fine to do such optimization for result metadata (present in ROWS and PREPARED) responses. It is however NOT fine to do this for prepared metadata (metadata about prepared statement bound variables). Such metadata can actually contain different table specs for different vars, when a user prepares a textual batch statement that contains statements operating on different tables.
We were bitten by this in Rust Driver before: scylladb/scylla-rust-driver#1134

OK - thanks! Not only I need to fix it, but I also believe we must add a test for this!

Fixed - and added a test.

mykaul · 2026-02-13T09:33:18Z

Performance Progression: Baseline → Pre-sizing+Indexing → +Reflection Elimination

Benchmark	Baseline	After Commit 1	Current	Total Improvement
RowData	1122 ns/op 720 B, 22 allocs	1092 ns/op 720 B, 22 allocs	463.3 ns/op 400 B, 12 allocs	🚀 58.7% faster -44% memory, -45% allocs
RowDataSmall	380.7 ns/op 216 B, 8 allocs	370.7 ns/op 216 B, 8 allocs	180.6 ns/op 120 B, 5 allocs	🚀 52.6% faster -44% memory, -38% allocs
RowDataLarge	5220 ns/op 3792 B, 102 allocs	5074 ns/op 3792 B, 102 allocs	1865 ns/op 2192 B, 52 allocs	🚀 64.3% faster -42% memory, -49% allocs
RowDataWithTypes	1216 ns/op 776 B, 22 allocs	1174 ns/op 776 B, 22 allocs	533.9 ns/op 456 B, 12 allocs	🚀 56.1% faster -41% memory, -45% allocs
RowDataWithTuples	1556 ns/op 616 B, 20 allocs	1369 ns/op 488 B, 18 allocs	1017 ns/op 328 B, 13 allocs	🚀 34.6% faster -47% memory, -35% allocs
RowDataRepeated	112088 ns/op 72000 B, 2200 allocs	108273 ns/op 72000 B, 2200 allocs	46064 ns/op 40000 B, 1200 allocs	🚀 58.9% faster -44% memory, -45% allocs
10cols	1144 ns/op 720 B, 22 allocs	1109 ns/op 720 B, 22 allocs	470.8 ns/op 400 B, 12 allocs	🚀 58.9% faster -44% memory, -45% allocs
100cols	10580 ns/op 7584 B, 202 allocs	10130 ns/op 7584 B, 202 allocs	3624 ns/op 4384 B, 102 allocs	🚀 65.7% faster -42% memory, -50% allocs
1000cols	103552 ns/op 72768 B, 2002 allocs	101583 ns/op 72768 B, 2002 allocs	35607 ns/op 40768 B, 1002 allocs	🚀 65.6% faster -44% memory, -50% allocs
WithTuples	1602 ns/op 616 B, 20 allocs	1390 ns/op 488 B, 18 allocs	1047 ns/op 328 B, 13 allocs	🚀 34.6% faster -47% memory, -35% allocs

mykaul · 2026-02-13T09:34:43Z

(Note - need to repeat the tests, will do later)

mykaul · 2026-02-14T11:12:37Z

Running go test -run=^$ -bench=Wiki -benchmem -tags integration -count=1 -cpu=1 -cluster=127.0.2.1 . did not convince me there's a winner here. I wonder if my benchmark isn't good, or I may not be testing well (most likely - my Scylla nodes may be running wherever and it's all on my laptop anyway).

mykaul changed the title ~~(Improvement) receive path performance improvement~~ (Improvement) receive path performance improvements Feb 5, 2026

mykaul marked this pull request as draft February 5, 2026 08:51

mykaul force-pushed the rec_perf_improvements branch from f90fadf to 500329a Compare February 5, 2026 09:42

mykaul marked this pull request as ready for review February 5, 2026 10:59

mykaul added enhancement performance labels Feb 5, 2026

mykaul requested a review from Copilot February 5, 2026 19:04

Copilot started reviewing on behalf of mykaul February 5, 2026 19:05 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

Comment thread frame.go Outdated

Comment thread frame.go Outdated

mykaul added 10 commits February 5, 2026 23:53

(improvement) change default buffer from 128b(!) to 4K

d2be651

128 is really really small. Most results are likely larger. 4K is reasonable (and happens to be the same size as in the Python driver) Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

(fix) fix review comment - remove useless (unreachable) error check.

5ee3be2

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

mykaul force-pushed the rec_perf_improvements branch from 500329a to 5ee3be2 Compare February 6, 2026 09:11

mykaul marked this pull request as draft February 6, 2026 12:21

Fix prepared batch metadata parsing

7f23382

mykaul changed the title ~~(Improvement) receive path performance improvements~~ (Improvement) receive path performance improvements - NOT for MERGE Mar 16, 2026

This was referenced Mar 16, 2026

perf: streamline readHeader() function #778

Merged

perf: eliminate reflection overhead in RowData() and type instantiation #779

Merged

perf: optimize column metadata parsing and readTypeInfo() #780

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

(Improvement) receive path performance improvements - NOT for MERGE#699

(Improvement) receive path performance improvements - NOT for MERGE#699
mykaul wants to merge 11 commits intoscylladb:masterfrom
mykaul:rec_perf_improvements

mykaul commented Feb 5, 2026

Uh oh!

mykaul commented Feb 5, 2026 •

edited

Loading

Uh oh!

mykaul commented Feb 5, 2026

Uh oh!

mykaul commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Lorak-mmk commented Feb 6, 2026

Uh oh!

mykaul commented Feb 9, 2026

Uh oh!

mykaul commented Feb 10, 2026

Uh oh!

mykaul commented Feb 13, 2026

Uh oh!

mykaul commented Feb 13, 2026

Uh oh!

mykaul commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mykaul commented Feb 5, 2026

Uh oh!

mykaul commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mykaul commented Feb 5, 2026

Uh oh!

mykaul commented Feb 5, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Lorak-mmk commented Feb 6, 2026

Uh oh!

mykaul commented Feb 9, 2026

Uh oh!

mykaul commented Feb 10, 2026

Uh oh!

mykaul commented Feb 13, 2026

Uh oh!

mykaul commented Feb 13, 2026

Uh oh!

mykaul commented Feb 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mykaul commented Feb 5, 2026 •

edited

Loading