Skip to content

Commit e637ca4

Browse files
perf: add views & cursor
1 parent 2957156 commit e637ca4

7 files changed

Lines changed: 1494 additions & 0 deletions

File tree

binary/BENCH.md

Lines changed: 104 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,104 @@
1+
# binary/ performance baseline
2+
3+
Comparison of the vendored `binary` package at the first branch commit
4+
(before any perf work on top of it) against HEAD (all branch perf work,
5+
including techniques #2, #4, #7, #8 added on top of the initial vendoring
6+
and the in-branch perf refactors).
7+
8+
## Setup
9+
10+
- Initial: `56bb04765b227a498a22a9a7f47a4c35a11c7576` ("perf: vendor and improve binary pkg")
11+
- HEAD: `d736ed98a0789c29ca6fc46ba5b010c86a351c80`
12+
- Host: Apple M4 Max, darwin/arm64
13+
- Runner: `go test -bench . -benchmem -benchtime=500ms -count=6 -run ^$ ./binary/`
14+
- Stats: `benchstat` (6 runs per benchmark)
15+
16+
## Headline wins (shared benchmarks, present on both sides)
17+
18+
| Benchmark | Initial | HEAD | Time | B/op | allocs/op |
19+
| --------------------------------- | --------: | --------: | -------: | ----------------: | --------------: |
20+
| Encode_Struct_Borsh | 248.8 ns | 199.2 ns | -19.9% | 248 -> 112 (-55%) | 4 -> 1 (-75%) |
21+
| Encode_Struct_Borsh_Buffered | 240.3 ns | 193.1 ns | -19.7% | 136 -> 0 (-100%) | 3 -> 0 (-100%) |
22+
| ByteCount/flat | 226.1 ns | 156.4 ns | -30.8% | 216 -> 120 (-44%) | 6 -> 2 (-67%) |
23+
| ByteCount/nested/small_list | 1385 ns | 919 ns | -33.7% | 720 -> 184 (-74%) | 41 -> 10 (-76%) |
24+
| ByteCount/nested/large_list | 17.48 us | 12.85 us | -26.5% | -31% | -67% |
25+
| ByteCount/deep/small_list | 4.42 us | 2.79 us | -36.7% | 2048 -> 312 (-85%) | 123 -> 26 (-79%) |
26+
| ByteCount/deep/large_list | 52.92 us | 38.69 us | -26.9% | -31% | -67% |
27+
| CompactU16 (reader) | 1.26 ns | 1.23 ns | -2.3% | - | - |
28+
| CompactU16Encode | 10.09 ns | 9.47 ns | -6.2% | - | - |
29+
| _uintSlice32_Decode_field_withCustomDecoder | 2.52 us | 2.47 us | -1.9% | - | - |
30+
31+
## Small regressions (micro-bench primitives)
32+
33+
Sub-nanosecond absolute regressions on single-primitive writes. Root
34+
cause is the `if e.fixedBuf && ...` branch added in `toWriter` for the
35+
fixed-buffer mode (#2). The branch predicts perfectly when fixed mode
36+
isn't in use, but it still requires one byte load; at the 3 ns granularity
37+
of a single WriteUintN call this shows up as +0.3-0.6 ns.
38+
39+
For hot loops that accumulate this cost, `Cursor` (#4) is the escape
40+
valve: it skips the Encoder primitives entirely and is ~12x faster for
41+
primitive-heavy workloads.
42+
43+
| Benchmark | Initial | HEAD | Delta |
44+
| --------------------------------- | --------: | --------: | -----------: |
45+
| Encode_WriteUint16 | 3.09 ns | 3.57 ns | +15.4% (+0.5 ns) |
46+
| Encode_WriteUint32 | 3.09 ns | 3.61 ns | +16.9% (+0.5 ns) |
47+
| Encode_WriteUint64 | 3.06 ns | 3.67 ns | +19.8% (+0.6 ns) |
48+
| Encode_WriteUint64_Buffered | 3.85 ns | 4.24 ns | +10.3% (+0.4 ns) |
49+
| Encode_CompactU16_1byte | 6.51 ns | 6.85 ns | +5.2% (+0.3 ns) |
50+
| Encode_CompactU16_2byte | 6.45 ns | 7.02 ns | +8.8% (+0.6 ns) |
51+
| Decode_SliceUint64_8k | 4.27 us | 4.48 us | +4.8% |
52+
| Decode_SliceUint32_8k | 2.36 us | 2.44 us | +3.2% |
53+
| Decode_ReadString_Copy | 29.7 ns | 31.2 ns | +5.3% |
54+
| Decode_ReadString_Borrow | 19.96 ns | 21.11 ns | +5.8% |
55+
56+
## HEAD-only (new capabilities)
57+
58+
Benchmarks for APIs introduced by techniques #2, #4, #7. No baseline
59+
exists on the initial commit. Reported for reference and as the reason
60+
the small primitive regressions are acceptable.
61+
62+
| Benchmark | ns/op | B/op | allocs/op | Technique |
63+
| --------------------------------- | --------: | ----: | --------: | --------- |
64+
| MarshalInto_Struct_Borsh | 200.4 | 0 | 0 | #2 EncodeInto |
65+
| Marshal_Struct_Borsh | 254.4 | 576 | 1 | (baseline for MarshalInto) |
66+
| MarshalInto_Struct_Bin | 123.7 | 0 | 0 | #2 EncodeInto |
67+
| Marshal_Struct_Bin | 178.1 | 576 | 1 | (baseline) |
68+
| TxHeader_Cursor | 10.66 | 0 | 0 | #4 Cursor |
69+
| TxHeader_Encoder | 65.21 | 112 | 1 | (baseline: Encoder-into) |
70+
| TxHeader_Raw | 13.46 | 0 | 0 | (hand-rolled lower bound) |
71+
| Cursor_8xU64LE | 4.14 | 0 | 0 | #4 Cursor |
72+
| Encoder_8xU64LE | 48.69 | 112 | 1 | (baseline) |
73+
| PatchBlockhash_ViewAs | 0.23 | 0 | 0 | #7 ViewAs |
74+
| PatchBlockhash_Copy | 0.23 | 0 | 0 | (raw copy) |
75+
| PatchBlockhash_DecodeEncode | 180.7 | 128 | 2 | (no-ViewAs baseline) |
76+
77+
## Geomean
78+
79+
`geomean: 117.8 ns -> 80.3 ns` over all 36 shared benchmarks -- **-31.9% overall**.
80+
81+
## Techniques landed on this branch
82+
83+
| # | Technique | Headline delta |
84+
| -- | ------------------------------------------ | -------------- |
85+
| #2 | EncodeInto (pre-sized output buffer) | 1 alloc -> 0 allocs; -16% to -28% ns/op |
86+
| #8 | Bounded allocations (MaxSliceLen/MaxMapLen, element-size-aware checks) | Closes map DoS (2^32 -> error) and slice element-size amplification. Zero perf cost. |
87+
| #4 | Cursor (zero-overhead write cursor) | 6.8-11.7x faster than Encoder for hand-rolled encoders |
88+
| #7 | ViewAs (in-place field mutation) | 730x faster than decode-then-encode round-trip for patches |
89+
90+
## Reproducing
91+
92+
```sh
93+
# On HEAD
94+
go test -bench . -benchmem -benchtime=500ms -count=6 -run '^$' ./binary/ > /tmp/bench-head.out
95+
96+
# Checkout the initial branch commit in a worktree to capture the baseline
97+
git worktree add --detach /tmp/solana-initial 56bb047
98+
(cd /tmp/solana-initial && go test -bench . -benchmem -benchtime=500ms -count=6 -run '^$' ./binary/ > /tmp/bench-initial.out)
99+
git worktree remove /tmp/solana-initial
100+
101+
# Compare
102+
go install golang.org/x/perf/cmd/benchstat@latest
103+
~/go/bin/benchstat /tmp/bench-initial.out /tmp/bench-head.out
104+
```

binary/cursor.go

Lines changed: 275 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,275 @@
1+
// Copyright 2024 github.com/gagliardetto
2+
//
3+
// Licensed under the Apache License, Version 2.0 (the "License");
4+
// you may not use this file except in compliance with the License.
5+
// You may obtain a copy of the License at
6+
//
7+
// http://www.apache.org/licenses/LICENSE-2.0
8+
//
9+
// Unless required by applicable law or agreed to in writing, software
10+
// distributed under the License is distributed on an "AS IS" BASIS,
11+
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
// See the License for the specific language governing permissions and
13+
// limitations under the License.
14+
15+
package bin
16+
17+
import (
18+
"encoding/binary"
19+
"math"
20+
)
21+
22+
// Cursor is a zero-overhead write cursor into a caller-provided byte
23+
// slice. Every primitive write is a single memory poke followed by
24+
// position advance — no error return, no scratch buffer, no encoding
25+
// dispatch. The caller pre-sizes the destination slice; writes past the
26+
// end cause a standard Go slice-bounds-out-of-range panic (no cushion).
27+
//
28+
// Cursor is the fastest encode path in this package. For safety-first
29+
// encoding with error returns and grow-on-overflow, use Encoder with
30+
// NewBinEncoderInto / NewBinEncoderBuf instead.
31+
//
32+
// Methods return the receiver so calls can chain. The primitive
33+
// integer methods (WriteU*, WriteI*, WriteF*) are simple enough to
34+
// inline; chained fluent code compiles to the same machine code as
35+
// imperative `c.WriteU8(1); c.WriteU8(2)` statements.
36+
//
37+
// Cursor is not safe for concurrent use.
38+
type Cursor struct {
39+
buf []byte
40+
pos int
41+
}
42+
43+
// NewCursor returns a Cursor positioned at offset 0 of dst. Writes will
44+
// advance through dst's backing array; the slice itself is never
45+
// reallocated.
46+
func NewCursor(dst []byte) *Cursor {
47+
return &Cursor{buf: dst}
48+
}
49+
50+
// NewCursorAt returns a Cursor starting at the specified offset into
51+
// dst. Useful when back-patching a header after knowing the body size:
52+
// allocate, Skip past the header region, write the body, then open a
53+
// second cursor at offset 0 to fill in the header fields.
54+
func NewCursorAt(dst []byte, offset int) *Cursor {
55+
return &Cursor{buf: dst, pos: offset}
56+
}
57+
58+
// --- State ---
59+
60+
// Len returns the number of bytes written so far.
61+
func (c *Cursor) Len() int { return c.pos }
62+
63+
// Cap returns the cursor's underlying buffer capacity. Writes past
64+
// Cap() panic.
65+
func (c *Cursor) Cap() int { return len(c.buf) }
66+
67+
// Remaining returns the number of bytes available for writing before
68+
// the next poke would panic.
69+
func (c *Cursor) Remaining() int { return len(c.buf) - c.pos }
70+
71+
// Pos returns the current write offset.
72+
func (c *Cursor) Pos() int { return c.pos }
73+
74+
// SetPos repositions the cursor at offset n. No bounds check — pass a
75+
// value in [0, Cap()]. Useful for back-patching after recording a
76+
// position.
77+
func (c *Cursor) SetPos(n int) *Cursor {
78+
c.pos = n
79+
return c
80+
}
81+
82+
// Reset repositions the cursor at offset 0. Buffer contents are
83+
// unchanged; subsequent writes overwrite them.
84+
func (c *Cursor) Reset() *Cursor {
85+
c.pos = 0
86+
return c
87+
}
88+
89+
// ResetTo repositions the cursor at offset 0 and rebinds it to dst.
90+
// Useful for reusing one Cursor across many messages without
91+
// allocating.
92+
func (c *Cursor) ResetTo(dst []byte) *Cursor {
93+
c.buf = dst
94+
c.pos = 0
95+
return c
96+
}
97+
98+
// Written returns a subslice of the underlying buffer covering the
99+
// bytes written so far (buf[:pos]). Aliases the cursor's backing array.
100+
// Copy the result if you need to retain it across further writes or
101+
// Reset.
102+
func (c *Cursor) Written() []byte { return c.buf[:c.pos] }
103+
104+
// Buffer returns the cursor's full underlying buffer. Aliases the
105+
// backing array.
106+
func (c *Cursor) Buffer() []byte { return c.buf }
107+
108+
// --- Single-byte primitives ---
109+
110+
// WriteU8 writes a uint8 and advances one byte.
111+
func (c *Cursor) WriteU8(v uint8) *Cursor {
112+
c.buf[c.pos] = v
113+
c.pos++
114+
return c
115+
}
116+
117+
// WriteI8 writes an int8 (reinterpreted as uint8) and advances one
118+
// byte.
119+
func (c *Cursor) WriteI8(v int8) *Cursor { return c.WriteU8(uint8(v)) }
120+
121+
// WriteBool writes 0x01 for true, 0x00 for false.
122+
func (c *Cursor) WriteBool(v bool) *Cursor {
123+
if v {
124+
return c.WriteU8(1)
125+
}
126+
return c.WriteU8(0)
127+
}
128+
129+
// --- Fixed-width integers: little-endian ---
130+
131+
func (c *Cursor) WriteU16LE(v uint16) *Cursor {
132+
binary.LittleEndian.PutUint16(c.buf[c.pos:], v)
133+
c.pos += 2
134+
return c
135+
}
136+
137+
func (c *Cursor) WriteU32LE(v uint32) *Cursor {
138+
binary.LittleEndian.PutUint32(c.buf[c.pos:], v)
139+
c.pos += 4
140+
return c
141+
}
142+
143+
func (c *Cursor) WriteU64LE(v uint64) *Cursor {
144+
binary.LittleEndian.PutUint64(c.buf[c.pos:], v)
145+
c.pos += 8
146+
return c
147+
}
148+
149+
func (c *Cursor) WriteI16LE(v int16) *Cursor { return c.WriteU16LE(uint16(v)) }
150+
func (c *Cursor) WriteI32LE(v int32) *Cursor { return c.WriteU32LE(uint32(v)) }
151+
func (c *Cursor) WriteI64LE(v int64) *Cursor { return c.WriteU64LE(uint64(v)) }
152+
153+
// --- Fixed-width integers: big-endian ---
154+
155+
func (c *Cursor) WriteU16BE(v uint16) *Cursor {
156+
binary.BigEndian.PutUint16(c.buf[c.pos:], v)
157+
c.pos += 2
158+
return c
159+
}
160+
161+
func (c *Cursor) WriteU32BE(v uint32) *Cursor {
162+
binary.BigEndian.PutUint32(c.buf[c.pos:], v)
163+
c.pos += 4
164+
return c
165+
}
166+
167+
func (c *Cursor) WriteU64BE(v uint64) *Cursor {
168+
binary.BigEndian.PutUint64(c.buf[c.pos:], v)
169+
c.pos += 8
170+
return c
171+
}
172+
173+
func (c *Cursor) WriteI16BE(v int16) *Cursor { return c.WriteU16BE(uint16(v)) }
174+
func (c *Cursor) WriteI32BE(v int32) *Cursor { return c.WriteU32BE(uint32(v)) }
175+
func (c *Cursor) WriteI64BE(v int64) *Cursor { return c.WriteU64BE(uint64(v)) }
176+
177+
// --- Floats ---
178+
179+
func (c *Cursor) WriteF32LE(v float32) *Cursor {
180+
binary.LittleEndian.PutUint32(c.buf[c.pos:], math.Float32bits(v))
181+
c.pos += 4
182+
return c
183+
}
184+
185+
func (c *Cursor) WriteF64LE(v float64) *Cursor {
186+
binary.LittleEndian.PutUint64(c.buf[c.pos:], math.Float64bits(v))
187+
c.pos += 8
188+
return c
189+
}
190+
191+
func (c *Cursor) WriteF32BE(v float32) *Cursor {
192+
binary.BigEndian.PutUint32(c.buf[c.pos:], math.Float32bits(v))
193+
c.pos += 4
194+
return c
195+
}
196+
197+
func (c *Cursor) WriteF64BE(v float64) *Cursor {
198+
binary.BigEndian.PutUint64(c.buf[c.pos:], math.Float64bits(v))
199+
c.pos += 8
200+
return c
201+
}
202+
203+
// --- Byte sequences ---
204+
205+
// WriteBytes copies src into the cursor buffer and advances len(src)
206+
// bytes. If src does not fit in Remaining() this panics with the
207+
// standard "index out of range" message from the underlying slice op.
208+
func (c *Cursor) WriteBytes(src []byte) *Cursor {
209+
n := copy(c.buf[c.pos:c.pos+len(src)], src)
210+
c.pos += n
211+
return c
212+
}
213+
214+
// WriteZero writes n zero bytes and advances n positions.
215+
func (c *Cursor) WriteZero(n int) *Cursor {
216+
end := c.pos + n
217+
clear(c.buf[c.pos:end])
218+
c.pos = end
219+
return c
220+
}
221+
222+
// Skip advances n positions without writing. The skipped bytes keep
223+
// whatever values the underlying buffer already held — callers should
224+
// overwrite them later or zero them via WriteZero if they need the
225+
// payload clean.
226+
func (c *Cursor) Skip(n int) *Cursor {
227+
c.pos += n
228+
return c
229+
}
230+
231+
// --- Length-prefix helpers ---
232+
//
233+
// Three variants cover the encoding schemes this package supports:
234+
// uvarint (bin), u32 little-endian (borsh), and Solana's compact-u16.
235+
// They all panic rather than returning errors so they remain
236+
// chainable; use the Encoder for error-returning variants.
237+
238+
// WriteLenBin writes a uvarint-encoded length (1–10 bytes). This matches
239+
// Encoder.WriteLength in EncodingBin mode.
240+
func (c *Cursor) WriteLenBin(l int) *Cursor {
241+
n := binary.PutUvarint(c.buf[c.pos:], uint64(l))
242+
c.pos += n
243+
return c
244+
}
245+
246+
// WriteLenBorsh writes a u32 little-endian length (4 bytes). Matches
247+
// Encoder.WriteLength in EncodingBorsh mode.
248+
func (c *Cursor) WriteLenBorsh(l int) *Cursor { return c.WriteU32LE(uint32(l)) }
249+
250+
// WriteLenCompactU16 writes Solana's compact-u16 length encoding (1–3
251+
// bytes). Panics if l > 0xFFFF.
252+
func (c *Cursor) WriteLenCompactU16(l int) *Cursor {
253+
n, err := PutCompactU16Length(c.buf[c.pos:c.pos+3], l)
254+
if err != nil {
255+
panic(err)
256+
}
257+
c.pos += n
258+
return c
259+
}
260+
261+
// --- UVarint/Varint for standalone values (not just lengths) ---
262+
263+
// WriteUvarint writes v as a uvarint (1–10 bytes).
264+
func (c *Cursor) WriteUvarint(v uint64) *Cursor {
265+
n := binary.PutUvarint(c.buf[c.pos:], v)
266+
c.pos += n
267+
return c
268+
}
269+
270+
// WriteVarint writes v as a zigzag-varint (1–10 bytes).
271+
func (c *Cursor) WriteVarint(v int64) *Cursor {
272+
n := binary.PutVarint(c.buf[c.pos:], v)
273+
c.pos += n
274+
return c
275+
}

0 commit comments

Comments
 (0)