perf(fibre): decrease encoding memory usage by 8x#7091
Closed
perf(fibre): decrease encoding memory usage by 8x#7091
Conversation
9a8905a to
44b976e
Compare
New internal/slab package: a growable byte-region allocator with mmap-backed slabs, deficit-sized growth, grace-period shrink, and a sort+run-merge+monotonic-hint free path that handles shuffled per-validator releases in amortized O(N). Replaces sync.Pool for fibre's multi-hundred-MiB Reed-Solomon work and parity buffers, which sync.Pool drops every GC cycle and inflates RSS from ~1.5 GiB theoretical to 8+ GiB in practice. Ships with unit tests, a fuzzer with per-op structural invariants, and benchmarks for contiguous/fragmented allocation, shuffled per-validator release, and reuse-after-partial. See fibre/internal/slab/DESIGN.md for properties, free-path optimizations, tradeoffs, and open future work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
9926ce4 to
a5535cc
Compare
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
d85ac86 to
fd9e8e6
Compare
Introduce RowAssembler to build K+N row layouts for Reed-Solomon encoding from a dedicated parity pool and a separate work pool, both backed by the internal slab allocator. Sharing one pool across the two classes caused steady-state multi-GiB/min slab churn under concurrent uploads; splitting them restores homogeneous slab shapes. Assembly owns the pooled storage for a single blob, supports per-validator partial release via Free(rowIndices) and terminal Free(nil), and uses a RWMutex so Blob.Row → Freed reads run in parallel across validator goroutines. Partial Free nils entries in the assembled row view so Rows() never returns stale pool pointers. Blob carries an *Assembly reference and routes its lifecycle through it. client_upload.go issues per-validator partial Free on intermediate goroutines and a terminal Free(nil) on the last goroutine, which also shrinks both pools at the blob lifecycle boundary. NewBlob consumes the blob on Upload — reuse fails with ErrBlobConsumed. rsema1d.NewCoder now accepts variadic reedsolomon.Option so the assembler's work pool can be plugged in via WithWorkAllocator. Bumps klauspost/reedsolomon to the pseudo-version that exposes the WorkAllocator interface. Ships with unit tests covering hybrid row layout, parity cleanliness across encoder reuse, partial vs terminal Free semantics, and a parallel-readers benchmark for the Assembly.Freed hot path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
fd9e8e6 to
00f33fc
Compare
3 tasks
Member
Author
|
Superseeded by #7159 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Important
Please read the summary carefully to understand the motivation behind such a change
Summary
Fibre's Reed-Solomon encoding is memory-hungry by nature: every in-flight encode keeps roughly 8× the blob size resident: one copy of the original data, three copies for parity rows, and four copies of FFT scratch space. With 128 MiB blobs and a dozen concurrent uploads(throughput agentic payments targets), that's more than ten gigabytes of large, short-lived buffers churning through the heap. On modestly-sized validator nodes, this is genuinely hard to fit.
The first attempt (#7069) assumed Go's
sync.Poolwould absorb the churn. It doesn't.sync.Poolis tied to the garbage collector, and buffers this large trigger collections frequently enough that the pool never actually retains anything. Every encode ends up allocating fresh, and freed pages linger long enough that resident memory grows to several times the live working set. Meanwhile, the GC's heap-doubling behavior stacks successive allocations on top of each other, pushing memory-constrained nodes into OOMs before the pool has a chance to help. Essentially, the Go runtime is not suitable for this type of encoding workflow, especially if the encoding fibre client is intended to be embedded by other processes.However, the workload is a good fit for explicit memory management (and I would argue even requires it): allocations are large, predictable, and bound to a clear lifecycle event (the end of a blob's upload) rather than to GC timing. That's an unusual shape for a Go program, but it's the same shape embedded databases like pebble and badger deal with, and they reach for the same solution. This PR introduces a small slab allocator that owns fibre's encoding buffers directly, grows only when real demand appears, hands regions back to the OS when they stop being used, and is otherwise invisible to the GC.
The encoder retrieves these buffers via the new
WorkAllocatorinterface added upstream in klauspost/reedsolomon#331, enabling full control over allocations for the work buffers during the encoding.The allocator, its tradeoffs, and the reasoning behind each design choice are documented in
fibre/internal/slab/DESIGN.md. It felt necessary to make a design document, given the complexity this may bring to the codebase.Commits
The total number of changes is rather large, so they are split into two commits, and those commits should be reviewed independently. Most of those changes are auxiliary, like tests, benchmarks, and fuzzing, so hopefully the actual code review won't be that big of an issue. If changes are perceived as large, I can spend time sharding them further.
perf(fibre): slab-backed pool allocator for RS work + parity buffers— the allocator itself, with tests, benchmarks, a fuzzer, and the design doc.feat(fibre): add RowAssembler for zero-allocation blob encoding— integration throughRowAssembler/Assembly, which owns pooled storage across an encode and releases it incrementally as validators finish.Result
On a 16 GiB node, memory usage sits at the theoretical floor: around 1.4 GiB for a single in-flight encode, and roughly 13–15 GiB for ten concurrent ones, comfortably within the box. Steady-state slab churn drops to a few MiB per minute as both pools warm up to the concurrency high-water mark and stay there.
Most of the stress testing focused on the largest blob size because that's where memory pressure actually shows up. However, the allocator is size-agnostic by design, and slabs grow to whatever deficit a request leaves behind and pack smaller rows into the same space, so small and medium blobs should benefit from the same retention and reuse properties without any special-casing. That said, the mixed-size path has seen less direct measurement than the homogeneous large-blob case, and fragmentation behavior under varied workloads is noted as a known area to keep an eye on in the design doc.
Closes #7085