solana-foundation · sonicfromnewyoke · Apr 23, 2026 · Apr 23, 2026 · Apr 11, 2026 · Apr 11, 2026
diff --git a/binary/BENCH.md b/binary/BENCH.md
@@ -0,0 +1,111 @@
+# binary/ performance baseline
+
+Comparison of the vendored `binary` package at the first branch commit
+(before any perf work on top of it) against HEAD (all branch perf work,
+including techniques #2, #4, #7, #8 added on top of the initial vendoring
+and the in-branch perf refactors).
+
+## Setup
+
+- Initial: `56bb04765b227a498a22a9a7f47a4c35a11c7576` ("perf: vendor and improve binary pkg")
+- HEAD:    `d736ed98a0789c29ca6fc46ba5b010c86a351c80`
+- Host:    Apple M4 Max, darwin/arm64
+- Runner:  `go test -bench . -benchmem -benchtime=500ms -count=6 -run ^$ ./binary/`
+- Stats:   `benchstat` (6 runs per benchmark)
+
+## Headline wins (shared benchmarks, present on both sides)
+
+| Benchmark                         | Initial   | HEAD      | Time     | B/op              | allocs/op       |
+| --------------------------------- | --------: | --------: | -------: | ----------------: | --------------: |
+| Encode_Struct_Borsh               | 248.8 ns  | 199.2 ns  | -19.9%   | 248 -> 112 (-55%) | 4 -> 1 (-75%)   |
+| Encode_Struct_Borsh_Buffered      | 240.3 ns  | 193.1 ns  | -19.7%   | 136 -> 0 (-100%)  | 3 -> 0 (-100%)  |
+| ByteCount/flat                    | 226.1 ns  | 156.4 ns  | -30.8%   | 216 -> 120 (-44%) | 6 -> 2 (-67%)   |
+| ByteCount/nested/small_list       | 1385 ns   | 919 ns    | -33.7%   | 720 -> 184 (-74%) | 41 -> 10 (-76%) |
+| ByteCount/nested/large_list       | 17.48 us  | 12.85 us  | -26.5%   | -31%              | -67%            |
+| ByteCount/deep/small_list         | 4.42 us   | 2.79 us   | -36.7%   | 2048 -> 312 (-85%) | 123 -> 26 (-79%) |
+| ByteCount/deep/large_list         | 52.92 us  | 38.69 us  | -26.9%   | -31%              | -67%            |
+| CompactU16 (reader)               | 1.26 ns   | 1.23 ns   | -2.3%    | -                 | -               |
+| CompactU16Encode                  | 10.09 ns  | 9.47 ns   | -6.2%    | -                 | -               |
+| _uintSlice32_Decode_field_withCustomDecoder | 2.52 us | 2.47 us | -1.9% | -            | -               |
+
+## Small regressions (micro-bench primitives)
+
+Sub-nanosecond absolute regressions on single-primitive writes. Root
+cause is the `if e.fixedBuf && ...` branch added in `toWriter` for the
+fixed-buffer mode (#2). The branch predicts perfectly when fixed mode
+isn't in use, but it still requires one byte load; at the 3 ns granularity
+of a single WriteUintN call this shows up as +0.3-0.6 ns.
+
+For hot loops that accumulate this cost, `Cursor` (#4) is the escape
+valve: it skips the Encoder primitives entirely and is ~12x faster for
+primitive-heavy workloads.
+
+| Benchmark                         | Initial   | HEAD      | Delta        |
+| --------------------------------- | --------: | --------: | -----------: |
+| Encode_WriteUint16                | 3.09 ns   | 3.57 ns   | +15.4% (+0.5 ns) |
+| Encode_WriteUint32                | 3.09 ns   | 3.61 ns   | +16.9% (+0.5 ns) |
+| Encode_WriteUint64                | 3.06 ns   | 3.67 ns   | +19.8% (+0.6 ns) |
+| Encode_WriteUint64_Buffered       | 3.85 ns   | 4.24 ns   | +10.3% (+0.4 ns) |
+| Encode_CompactU16_1byte           | 6.51 ns   | 6.85 ns   | +5.2%  (+0.3 ns) |
+| Encode_CompactU16_2byte           | 6.45 ns   | 7.02 ns   | +8.8%  (+0.6 ns) |
+| Decode_SliceUint64_8k             | 4.27 us   | 4.48 us   | +4.8%        |
+| Decode_SliceUint32_8k             | 2.36 us   | 2.44 us   | +3.2%        |
+| Decode_ReadString_Copy            | 29.7 ns   | 31.2 ns   | +5.3%        |
+| Decode_ReadString_Borrow          | 19.96 ns  | 21.11 ns  | +5.8%        |
+
+## HEAD-only (new capabilities)
+
+Benchmarks for APIs introduced by techniques #2, #4, #5, #7. No baseline
+exists on the initial commit. Reported for reference and as the reason
+the small primitive regressions are acceptable.
+
+| Benchmark                         | ns/op     | B/op  | allocs/op | Technique |
+| --------------------------------- | --------: | ----: | --------: | --------- |
+| MarshalInto_Struct_Borsh          | 200.4     | 0     | 0         | #2 EncodeInto |
+| Marshal_Struct_Borsh              | 254.4     | 576   | 1         | (baseline for MarshalInto) |
+| MarshalInto_Struct_Bin            | 123.7     | 0     | 0         | #2 EncodeInto |
+| Marshal_Struct_Bin                | 178.1     | 576   | 1         | (baseline) |
+| TxHeader_Cursor                   | 10.66     | 0     | 0         | #4 Cursor |
+| TxHeader_Encoder                  | 65.21     | 112   | 1         | (baseline: Encoder-into) |
+| TxHeader_Raw                      | 13.46     | 0     | 0         | (hand-rolled lower bound) |
+| Cursor_8xU64LE                    | 4.14      | 0     | 0         | #4 Cursor |
+| Encoder_8xU64LE                   | 48.69     | 112   | 1         | (baseline) |
+| MarshalPOD_Pubkey (32 B)          | 0.25      | 0     | 0         | #5 MarshalPOD |
+| MarshalBorshInto_Pubkey           | 57.96     | 0     | 0         | (baseline for MarshalPOD) |
+| MarshalPOD_BigStruct (8 x u64)    | 0.25      | 0     | 0         | #5 MarshalPOD |
+| MarshalBorshInto_BigStruct        | 117.7     | 0     | 0         | (baseline) |
+| UnmarshalPOD_BigStruct            | 0.76      | 0     | 0         | #5 UnmarshalPOD |
+| UnmarshalBorsh_BigStruct          | 59.60     | 0     | 0         | (baseline for UnmarshalPOD) |
+| PatchBlockhash_ViewAs             | 0.23      | 0     | 0         | #7 ViewAs |
+| PatchBlockhash_Copy               | 0.23      | 0     | 0         | (raw copy) |
+| PatchBlockhash_DecodeEncode       | 180.7     | 128   | 2         | (no-ViewAs baseline) |
+
+## Geomean
+
+`geomean: 117.8 ns -> 80.3 ns` over all 36 shared benchmarks -- **-31.9% overall**.
+
+## Techniques landed on this branch
+
+| #  | Technique                                  | Headline delta |
+| -- | ------------------------------------------ | -------------- |
+| #2 | EncodeInto (pre-sized output buffer)       | 1 alloc -> 0 allocs; -16% to -28% ns/op |
+| #8 | Bounded allocations (MaxSliceLen/MaxMapLen, element-size-aware checks) | Closes map DoS (2^32 -> error) and slice element-size amplification. Zero perf cost. |
+| #4 | Cursor (zero-overhead write cursor)        | 6.8-11.7x faster than Encoder for hand-rolled encoders |
+| #7 | ViewAs (in-place field mutation)           | 730x faster than decode-then-encode round-trip for patches |
+| #5 | MarshalPOD / UnmarshalPOD (generic memcpy) | 230-470x faster than reflection-driven Marshal for pure-POD types |
+
+## Reproducing
+
+```sh
+# On HEAD
+go test -bench . -benchmem -benchtime=500ms -count=6 -run '^$' ./binary/ > /tmp/bench-head.out
+
+# Checkout the initial branch commit in a worktree to capture the baseline
+git worktree add --detach /tmp/solana-initial 56bb047
+(cd /tmp/solana-initial && go test -bench . -benchmem -benchtime=500ms -count=6 -run '^$' ./binary/ > /tmp/bench-initial.out)
+git worktree remove /tmp/solana-initial
+
+# Compare
+go install golang.org/x/perf/cmd/benchstat@latest
+~/go/bin/benchstat /tmp/bench-initial.out /tmp/bench-head.out
+```