Skip to content

perf(fibre): decrease encoding memory usage by 8x#7091

Closed
Wondertan wants to merge 2 commits intomainfrom
feat/slab-pool-allocator
Closed

perf(fibre): decrease encoding memory usage by 8x#7091
Wondertan wants to merge 2 commits intomainfrom
feat/slab-pool-allocator

Conversation

@Wondertan
Copy link
Copy Markdown
Member

@Wondertan Wondertan commented Apr 16, 2026

Important

Please read the summary carefully to understand the motivation behind such a change

Summary

Fibre's Reed-Solomon encoding is memory-hungry by nature: every in-flight encode keeps roughly 8× the blob size resident: one copy of the original data, three copies for parity rows, and four copies of FFT scratch space. With 128 MiB blobs and a dozen concurrent uploads(throughput agentic payments targets), that's more than ten gigabytes of large, short-lived buffers churning through the heap. On modestly-sized validator nodes, this is genuinely hard to fit.

The first attempt (#7069) assumed Go's sync.Pool would absorb the churn. It doesn't. sync.Pool is tied to the garbage collector, and buffers this large trigger collections frequently enough that the pool never actually retains anything. Every encode ends up allocating fresh, and freed pages linger long enough that resident memory grows to several times the live working set. Meanwhile, the GC's heap-doubling behavior stacks successive allocations on top of each other, pushing memory-constrained nodes into OOMs before the pool has a chance to help. Essentially, the Go runtime is not suitable for this type of encoding workflow, especially if the encoding fibre client is intended to be embedded by other processes.

However, the workload is a good fit for explicit memory management (and I would argue even requires it): allocations are large, predictable, and bound to a clear lifecycle event (the end of a blob's upload) rather than to GC timing. That's an unusual shape for a Go program, but it's the same shape embedded databases like pebble and badger deal with, and they reach for the same solution. This PR introduces a small slab allocator that owns fibre's encoding buffers directly, grows only when real demand appears, hands regions back to the OS when they stop being used, and is otherwise invisible to the GC.

The encoder retrieves these buffers via the new WorkAllocator interface added upstream in klauspost/reedsolomon#331, enabling full control over allocations for the work buffers during the encoding.

The allocator, its tradeoffs, and the reasoning behind each design choice are documented in fibre/internal/slab/DESIGN.md. It felt necessary to make a design document, given the complexity this may bring to the codebase.

Commits

The total number of changes is rather large, so they are split into two commits, and those commits should be reviewed independently. Most of those changes are auxiliary, like tests, benchmarks, and fuzzing, so hopefully the actual code review won't be that big of an issue. If changes are perceived as large, I can spend time sharding them further.

  1. perf(fibre): slab-backed pool allocator for RS work + parity buffers — the allocator itself, with tests, benchmarks, a fuzzer, and the design doc.
  2. feat(fibre): add RowAssembler for zero-allocation blob encoding — integration through RowAssembler / Assembly, which owns pooled storage across an encode and releases it incrementally as validators finish.

Result

On a 16 GiB node, memory usage sits at the theoretical floor: around 1.4 GiB for a single in-flight encode, and roughly 13–15 GiB for ten concurrent ones, comfortably within the box. Steady-state slab churn drops to a few MiB per minute as both pools warm up to the concurrency high-water mark and stay there.

Most of the stress testing focused on the largest blob size because that's where memory pressure actually shows up. However, the allocator is size-agnostic by design, and slabs grow to whatever deficit a request leaves behind and pack smaller rows into the same space, so small and medium blobs should benefit from the same retention and reuse properties without any special-casing. That said, the mixed-size path has seen less direct measurement than the homogeneous large-blob case, and fragmentation behavior under varied workloads is noted as a known area to keep an eye on in the design doc.

Closes #7085

@Wondertan Wondertan force-pushed the feat/slab-pool-allocator branch from 9a8905a to 44b976e Compare April 20, 2026 00:01
@Wondertan Wondertan changed the title perf(fibre): slab-backed pool allocator for RS work + parity buffers perf(fibre): decrease encoding memory usage by 8x Apr 20, 2026
New internal/slab package: a growable byte-region allocator with
mmap-backed slabs, deficit-sized growth, grace-period shrink, and a
sort+run-merge+monotonic-hint free path that handles shuffled
per-validator releases in amortized O(N). Replaces sync.Pool for
fibre's multi-hundred-MiB Reed-Solomon work and parity buffers,
which sync.Pool drops every GC cycle and inflates RSS from ~1.5 GiB
theoretical to 8+ GiB in practice.

Ships with unit tests, a fuzzer with per-op structural invariants,
and benchmarks for contiguous/fragmented allocation, shuffled
per-validator release, and reuse-after-partial.

See fibre/internal/slab/DESIGN.md for properties, free-path
optimizations, tradeoffs, and open future work.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Wondertan Wondertan force-pushed the feat/slab-pool-allocator branch 2 times, most recently from 9926ce4 to a5535cc Compare April 20, 2026 00:37
@Wondertan Wondertan marked this pull request as ready for review April 20, 2026 12:58
@Wondertan Wondertan requested a review from a team as a code owner April 20, 2026 12:58
@Wondertan Wondertan requested review from evan-forbes and removed request for a team April 20, 2026 12:58
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 6 additional findings in Devin Review.

Open in Devin Review

Comment thread fibre/client_upload.go
@Wondertan Wondertan force-pushed the feat/slab-pool-allocator branch 2 times, most recently from d85ac86 to fd9e8e6 Compare April 20, 2026 14:34
Introduce RowAssembler to build K+N row layouts for Reed-Solomon
encoding from a dedicated parity pool and a separate work pool, both
backed by the internal slab allocator. Sharing one pool across the
two classes caused steady-state multi-GiB/min slab churn under
concurrent uploads; splitting them restores homogeneous slab shapes.

Assembly owns the pooled storage for a single blob, supports
per-validator partial release via Free(rowIndices) and terminal
Free(nil), and uses a RWMutex so Blob.Row → Freed reads run in
parallel across validator goroutines. Partial Free nils entries in
the assembled row view so Rows() never returns stale pool pointers.

Blob carries an *Assembly reference and routes its lifecycle through
it. client_upload.go issues per-validator partial Free on intermediate
goroutines and a terminal Free(nil) on the last goroutine, which also
shrinks both pools at the blob lifecycle boundary. NewBlob consumes
the blob on Upload — reuse fails with ErrBlobConsumed.

rsema1d.NewCoder now accepts variadic reedsolomon.Option so the
assembler's work pool can be plugged in via WithWorkAllocator. Bumps
klauspost/reedsolomon to the pseudo-version that exposes the
WorkAllocator interface.

Ships with unit tests covering hybrid row layout, parity cleanliness
across encoder reuse, partial vs terminal Free semantics, and a
parallel-readers benchmark for the Assembly.Freed hot path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Wondertan
Copy link
Copy Markdown
Member Author

Superseeded by #7159

@Wondertan Wondertan closed this Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

rsema1d: eliminate work buffer pool churn in leopard encoder for variable shard sizes

1 participant