Skip to content

feat(fibre): bound upload memory budget by bytes#7155

Draft
walldiss wants to merge 3 commits intochore/fibre-upload-isolationfrom
chore/fibre-concurrent-blobs
Draft

feat(fibre): bound upload memory budget by bytes#7155
walldiss wants to merge 3 commits intochore/fibre-upload-isolationfrom
chore/fibre-concurrent-blobs

Conversation

@walldiss
Copy link
Copy Markdown
Member

@walldiss walldiss commented Apr 21, 2026

Closes: https://linear.app/celestia/issue/PROTOCO-1557/fibre-blob-level-memory-admission-for-concurrent-uploads

Summary

Cap in-flight upload memory by bytes, not by a coarse blob count. ClientConfig.UploadMemoryBudget (default 512 MiB) is drawn down by each Upload() proportional to blob.UploadSize() and released on exit. Concurrent uploads share the budget: small blobs pack tightly, a max-size blob takes its actual share, and an oversized blob fails fast rather than deadlocking.

Why bytes, not blobs

Memory is what we're bounding. Blob sizes vary ~1 MiB to 128 MiB — a "number of blobs" cap is a coarse proxy that either over-admits small blobs (wasting memory headroom by counting them like big ones) or under-admits them (wasting concurrency). Byte accounting collapses this into one honest knob: "how much RAM am I willing to spend on in-flight upload buffers?"

Why this is separate from #7154

#7154 removed the old UploadConcurrency semaphore because its unit (one RPC) didn't match any real resource — wrong abstraction for failure isolation. The memory-admission concern it implicitly addressed is a different problem with a different right answer, and deserves its own PR and defaults discussion.

Change

  • New ClientConfig.UploadMemoryBudget (int64 bytes, default 512 MiB) with Validate() clamping.
  • Client.uploadBudget is a golang.org/x/sync/semaphore.Weighted (already an indirect dep; now direct).
  • Upload() reserves blob.UploadSize() at entry via Acquire(ctx, n) and Release(n) on exit. An oversized blob returns a clear error; ctx cancellation during wait returns ctx.Err().

Sizing guidance

The default (512 MiB) accommodates roughly 4 concurrent max-size (128 MiB) blobs on a validator-grade node. Tune against:

UploadMemoryBudget = max_memory_for_upload_buffers
                   ≳ 1 × max_blob_size     (so uploads aren't serialised)
                   ≲ safe_fraction_of_ram  (leave room for everything else)

An oversized blob (UploadSize > UploadMemoryBudget) returns an error at Upload() entry — a config that would otherwise deadlock is surfaced immediately.

Depends on

#7154 — builds on the ClientConfig / Client surface introduced there. Base branch is chore/fibre-upload-isolation.

Test plan

  • Builds and full fibre test suite passes
  • Bench: burst of small blobs should pack and finish; burst of large blobs should serialise; oversized blob returns error immediately

🤖 Generated with Claude Code

@walldiss walldiss force-pushed the chore/fibre-concurrent-blobs branch from 584fb23 to bb66365 Compare April 21, 2026 14:39
@walldiss walldiss changed the title feat(fibre): bound concurrent blob uploads by memory budget feat(fibre): bound upload memory budget by bytes Apr 21, 2026
walldiss and others added 2 commits April 21, 2026 16:45
…layered timeouts

Replace the global RPC semaphore with a per-peer circuit breaker and
non-blocking fan-out. A dead validator holds only its own lane for one
DialTimeout; subsequent blob uploads skip it at zero cost via the
breaker. Throughput becomes insensitive to up to 1/3 peers down,
matching Fibre's BFT liveness bound.

- non-blocking fan-out: goroutines spawn up front; the circuit breaker
  check happens inside each goroutine, so a slow peer cannot delay
  other peers' goroutines from starting
- per-peer circuit breaker (CircuitFailureThreshold / CircuitCooldown):
  dead peer's cost is paid once (at first observation) and amortized
  across all subsequent blob uploads; closed-state Allow hits a
  lock-free atomic fast path
- layered timeouts: DialTimeout (3s) + RPCTimeout (15s) replace the
  single undifferentiated timeout so a black-holed peer is shed at
  dial time and healthy-but-slow peers get a generous RPC budget
- best-effort post-quorum delivery: Upload returns at 2/3 but
  background goroutines keep delivering to remaining peers so
  downloaders have more validators to read from
- circuit breaker state transitions emit a log line for operator
  visibility

BREAKING CHANGE: ClientConfig.UploadConcurrency is removed. The upload
path no longer exposes an RPC-count knob; concurrency is bounded by
the validator set size per blob plus the caller's own Upload-rate
discipline. Memory admission and peer-registry pruning are tracked
as follow-ups.

Closes: https://linear.app/celestia/issue/PROTOCO-1556/fibre-isolate-upload-failures-per-peer-to-preserve-23-quorum

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…layered timeouts

Replace the global RPC semaphore with a per-peer circuit breaker and
non-blocking fan-out. A dead validator holds only its own lane for one
DialTimeout; subsequent blob uploads skip it at zero cost via the
breaker. Throughput becomes insensitive to up to 1/3 peers down,
matching Fibre's BFT liveness bound.

- non-blocking fan-out: goroutines spawn up front; the circuit breaker
  check happens inside each goroutine, so a slow peer cannot delay
  other peers' goroutines from starting
- per-peer circuit breaker (CircuitFailureThreshold / CircuitCooldown):
  dead peer's cost is paid once (at first observation) and amortized
  across all subsequent blob uploads; closed-state Allow hits a
  lock-free atomic fast path
- layered timeouts: DialTimeout (3s) + RPCTimeout (15s) replace the
  single undifferentiated timeout so a black-holed peer is shed at
  dial time and healthy-but-slow peers get a generous RPC budget
- best-effort post-quorum delivery: Upload returns at 2/3 but
  background goroutines keep delivering to remaining peers so
  downloaders have more validators to read from
- circuit breaker state transitions emit a log line for operator
  visibility

BREAKING CHANGE: ClientConfig.UploadConcurrency is removed. The upload
path no longer exposes an RPC-count knob; concurrency is bounded by
the validator set size per blob plus the caller's own Upload-rate
discipline. Memory admission and peer-registry pruning are tracked
as follow-ups.

Closes: https://linear.app/celestia/issue/PROTOCO-1556/fibre-isolate-upload-failures-per-peer-to-preserve-23-quorum

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@walldiss walldiss force-pushed the chore/fibre-upload-isolation branch from 54a7c4e to c184fa4 Compare April 21, 2026 14:45
Introduce ClientConfig.UploadMemoryBudget (default 512 MiB) keyed on
bytes rather than blob count. Each Upload() reserves blob.UploadSize()
from a weighted semaphore at entry and releases on exit. Concurrent
uploads share the budget, so small blobs pack efficiently and large
blobs take their actual share instead of being accounted as a fixed
"slot."

Byte-level accounting matches the actual resource being bounded
(upload buffers) instead of using blob count as a coarse proxy. An
oversized blob — one whose UploadSize exceeds the total budget —
fails fast rather than deadlocking on a reservation that can never
be satisfied.

This is a companion to #7154, which removed the old UploadConcurrency
semaphore because it was the wrong abstraction for failure isolation.
This PR adds the right abstraction for memory admission.

- new ClientConfig.UploadMemoryBudget (int64 bytes, default 512 MiB)
- Client.uploadBudget is a golang.org/x/sync/semaphore.Weighted
- Upload() acquires blob.UploadSize() with ctx-cancellable Acquire;
  Releases on exit; rejects oversized blobs with a clear error

Closes: https://linear.app/celestia/issue/PROTOCO-1557/fibre-blob-level-memory-admission-for-concurrent-uploads

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@walldiss walldiss force-pushed the chore/fibre-concurrent-blobs branch from bb66365 to b00a8c1 Compare April 21, 2026 14:47
@walldiss walldiss force-pushed the chore/fibre-upload-isolation branch from c184fa4 to d833b63 Compare April 21, 2026 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant