Comparison of the vendored binary package at the first branch commit
(before any perf work on top of it) against HEAD (all branch perf work,
including techniques #2, #4, #7, #8 added on top of the initial vendoring
and the in-branch perf refactors).
- Initial:
56bb04765b227a498a22a9a7f47a4c35a11c7576("perf: vendor and improve binary pkg") - HEAD:
d736ed98a0789c29ca6fc46ba5b010c86a351c80 - Host: Apple M4 Max, darwin/arm64
- Runner:
go test -bench . -benchmem -benchtime=500ms -count=6 -run ^$ ./binary/ - Stats:
benchstat(6 runs per benchmark)
| Benchmark | Initial | HEAD | Time | B/op | allocs/op |
|---|---|---|---|---|---|
| Encode_Struct_Borsh | 248.8 ns | 199.2 ns | -19.9% | 248 -> 112 (-55%) | 4 -> 1 (-75%) |
| Encode_Struct_Borsh_Buffered | 240.3 ns | 193.1 ns | -19.7% | 136 -> 0 (-100%) | 3 -> 0 (-100%) |
| ByteCount/flat | 226.1 ns | 156.4 ns | -30.8% | 216 -> 120 (-44%) | 6 -> 2 (-67%) |
| ByteCount/nested/small_list | 1385 ns | 919 ns | -33.7% | 720 -> 184 (-74%) | 41 -> 10 (-76%) |
| ByteCount/nested/large_list | 17.48 us | 12.85 us | -26.5% | -31% | -67% |
| ByteCount/deep/small_list | 4.42 us | 2.79 us | -36.7% | 2048 -> 312 (-85%) | 123 -> 26 (-79%) |
| ByteCount/deep/large_list | 52.92 us | 38.69 us | -26.9% | -31% | -67% |
| CompactU16 (reader) | 1.26 ns | 1.23 ns | -2.3% | - | - |
| CompactU16Encode | 10.09 ns | 9.47 ns | -6.2% | - | - |
| _uintSlice32_Decode_field_withCustomDecoder | 2.52 us | 2.47 us | -1.9% | - | - |
Sub-nanosecond absolute regressions on single-primitive writes. Root
cause is the if e.fixedBuf && ... branch added in toWriter for the
fixed-buffer mode (#2). The branch predicts perfectly when fixed mode
isn't in use, but it still requires one byte load; at the 3 ns granularity
of a single WriteUintN call this shows up as +0.3-0.6 ns.
For hot loops that accumulate this cost, Cursor (#4) is the escape
valve: it skips the Encoder primitives entirely and is ~12x faster for
primitive-heavy workloads.
| Benchmark | Initial | HEAD | Delta |
|---|---|---|---|
| Encode_WriteUint16 | 3.09 ns | 3.57 ns | +15.4% (+0.5 ns) |
| Encode_WriteUint32 | 3.09 ns | 3.61 ns | +16.9% (+0.5 ns) |
| Encode_WriteUint64 | 3.06 ns | 3.67 ns | +19.8% (+0.6 ns) |
| Encode_WriteUint64_Buffered | 3.85 ns | 4.24 ns | +10.3% (+0.4 ns) |
| Encode_CompactU16_1byte | 6.51 ns | 6.85 ns | +5.2% (+0.3 ns) |
| Encode_CompactU16_2byte | 6.45 ns | 7.02 ns | +8.8% (+0.6 ns) |
| Decode_SliceUint64_8k | 4.27 us | 4.48 us | +4.8% |
| Decode_SliceUint32_8k | 2.36 us | 2.44 us | +3.2% |
| Decode_ReadString_Copy | 29.7 ns | 31.2 ns | +5.3% |
| Decode_ReadString_Borrow | 19.96 ns | 21.11 ns | +5.8% |
Benchmarks for APIs introduced by techniques #2, #4, #5, #7. No baseline exists on the initial commit. Reported for reference and as the reason the small primitive regressions are acceptable.
| Benchmark | ns/op | B/op | allocs/op | Technique |
|---|---|---|---|---|
| MarshalInto_Struct_Borsh | 200.4 | 0 | 0 | #2 EncodeInto |
| Marshal_Struct_Borsh | 254.4 | 576 | 1 | (baseline for MarshalInto) |
| MarshalInto_Struct_Bin | 123.7 | 0 | 0 | #2 EncodeInto |
| Marshal_Struct_Bin | 178.1 | 576 | 1 | (baseline) |
| TxHeader_Cursor | 10.66 | 0 | 0 | #4 Cursor |
| TxHeader_Encoder | 65.21 | 112 | 1 | (baseline: Encoder-into) |
| TxHeader_Raw | 13.46 | 0 | 0 | (hand-rolled lower bound) |
| Cursor_8xU64LE | 4.14 | 0 | 0 | #4 Cursor |
| Encoder_8xU64LE | 48.69 | 112 | 1 | (baseline) |
| MarshalPOD_Pubkey (32 B) | 0.25 | 0 | 0 | #5 MarshalPOD |
| MarshalBorshInto_Pubkey | 57.96 | 0 | 0 | (baseline for MarshalPOD) |
| MarshalPOD_BigStruct (8 x u64) | 0.25 | 0 | 0 | #5 MarshalPOD |
| MarshalBorshInto_BigStruct | 117.7 | 0 | 0 | (baseline) |
| UnmarshalPOD_BigStruct | 0.76 | 0 | 0 | #5 UnmarshalPOD |
| UnmarshalBorsh_BigStruct | 59.60 | 0 | 0 | (baseline for UnmarshalPOD) |
| PatchBlockhash_ViewAs | 0.23 | 0 | 0 | #7 ViewAs |
| PatchBlockhash_Copy | 0.23 | 0 | 0 | (raw copy) |
| PatchBlockhash_DecodeEncode | 180.7 | 128 | 2 | (no-ViewAs baseline) |
geomean: 117.8 ns -> 80.3 ns over all 36 shared benchmarks -- -31.9% overall.
| # | Technique | Headline delta |
|---|---|---|
| #2 | EncodeInto (pre-sized output buffer) | 1 alloc -> 0 allocs; -16% to -28% ns/op |
| #8 | Bounded allocations (MaxSliceLen/MaxMapLen, element-size-aware checks) | Closes map DoS (2^32 -> error) and slice element-size amplification. Zero perf cost. |
| #4 | Cursor (zero-overhead write cursor) | 6.8-11.7x faster than Encoder for hand-rolled encoders |
| #7 | ViewAs (in-place field mutation) | 730x faster than decode-then-encode round-trip for patches |
| #5 | MarshalPOD / UnmarshalPOD (generic memcpy) | 230-470x faster than reflection-driven Marshal for pure-POD types |
# On HEAD
go test -bench . -benchmem -benchtime=500ms -count=6 -run '^$' ./binary/ > /tmp/bench-head.out
# Checkout the initial branch commit in a worktree to capture the baseline
git worktree add --detach /tmp/solana-initial 56bb047
(cd /tmp/solana-initial && go test -bench . -benchmem -benchtime=500ms -count=6 -run '^$' ./binary/ > /tmp/bench-initial.out)
git worktree remove /tmp/solana-initial
# Compare
go install golang.org/x/perf/cmd/benchstat@latest
~/go/bin/benchstat /tmp/bench-initial.out /tmp/bench-head.out