feat(fibre): bound upload memory budget by bytes#7155
Draft
walldiss wants to merge 3 commits intochore/fibre-upload-isolationfrom
Draft
feat(fibre): bound upload memory budget by bytes#7155walldiss wants to merge 3 commits intochore/fibre-upload-isolationfrom
walldiss wants to merge 3 commits intochore/fibre-upload-isolationfrom
Conversation
584fb23 to
bb66365
Compare
…layered timeouts Replace the global RPC semaphore with a per-peer circuit breaker and non-blocking fan-out. A dead validator holds only its own lane for one DialTimeout; subsequent blob uploads skip it at zero cost via the breaker. Throughput becomes insensitive to up to 1/3 peers down, matching Fibre's BFT liveness bound. - non-blocking fan-out: goroutines spawn up front; the circuit breaker check happens inside each goroutine, so a slow peer cannot delay other peers' goroutines from starting - per-peer circuit breaker (CircuitFailureThreshold / CircuitCooldown): dead peer's cost is paid once (at first observation) and amortized across all subsequent blob uploads; closed-state Allow hits a lock-free atomic fast path - layered timeouts: DialTimeout (3s) + RPCTimeout (15s) replace the single undifferentiated timeout so a black-holed peer is shed at dial time and healthy-but-slow peers get a generous RPC budget - best-effort post-quorum delivery: Upload returns at 2/3 but background goroutines keep delivering to remaining peers so downloaders have more validators to read from - circuit breaker state transitions emit a log line for operator visibility BREAKING CHANGE: ClientConfig.UploadConcurrency is removed. The upload path no longer exposes an RPC-count knob; concurrency is bounded by the validator set size per blob plus the caller's own Upload-rate discipline. Memory admission and peer-registry pruning are tracked as follow-ups. Closes: https://linear.app/celestia/issue/PROTOCO-1556/fibre-isolate-upload-failures-per-peer-to-preserve-23-quorum Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…layered timeouts Replace the global RPC semaphore with a per-peer circuit breaker and non-blocking fan-out. A dead validator holds only its own lane for one DialTimeout; subsequent blob uploads skip it at zero cost via the breaker. Throughput becomes insensitive to up to 1/3 peers down, matching Fibre's BFT liveness bound. - non-blocking fan-out: goroutines spawn up front; the circuit breaker check happens inside each goroutine, so a slow peer cannot delay other peers' goroutines from starting - per-peer circuit breaker (CircuitFailureThreshold / CircuitCooldown): dead peer's cost is paid once (at first observation) and amortized across all subsequent blob uploads; closed-state Allow hits a lock-free atomic fast path - layered timeouts: DialTimeout (3s) + RPCTimeout (15s) replace the single undifferentiated timeout so a black-holed peer is shed at dial time and healthy-but-slow peers get a generous RPC budget - best-effort post-quorum delivery: Upload returns at 2/3 but background goroutines keep delivering to remaining peers so downloaders have more validators to read from - circuit breaker state transitions emit a log line for operator visibility BREAKING CHANGE: ClientConfig.UploadConcurrency is removed. The upload path no longer exposes an RPC-count knob; concurrency is bounded by the validator set size per blob plus the caller's own Upload-rate discipline. Memory admission and peer-registry pruning are tracked as follow-ups. Closes: https://linear.app/celestia/issue/PROTOCO-1556/fibre-isolate-upload-failures-per-peer-to-preserve-23-quorum Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
54a7c4e to
c184fa4
Compare
Introduce ClientConfig.UploadMemoryBudget (default 512 MiB) keyed on bytes rather than blob count. Each Upload() reserves blob.UploadSize() from a weighted semaphore at entry and releases on exit. Concurrent uploads share the budget, so small blobs pack efficiently and large blobs take their actual share instead of being accounted as a fixed "slot." Byte-level accounting matches the actual resource being bounded (upload buffers) instead of using blob count as a coarse proxy. An oversized blob — one whose UploadSize exceeds the total budget — fails fast rather than deadlocking on a reservation that can never be satisfied. This is a companion to #7154, which removed the old UploadConcurrency semaphore because it was the wrong abstraction for failure isolation. This PR adds the right abstraction for memory admission. - new ClientConfig.UploadMemoryBudget (int64 bytes, default 512 MiB) - Client.uploadBudget is a golang.org/x/sync/semaphore.Weighted - Upload() acquires blob.UploadSize() with ctx-cancellable Acquire; Releases on exit; rejects oversized blobs with a clear error Closes: https://linear.app/celestia/issue/PROTOCO-1557/fibre-blob-level-memory-admission-for-concurrent-uploads Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bb66365 to
b00a8c1
Compare
c184fa4 to
d833b63
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes: https://linear.app/celestia/issue/PROTOCO-1557/fibre-blob-level-memory-admission-for-concurrent-uploads
Summary
Cap in-flight upload memory by bytes, not by a coarse blob count.
ClientConfig.UploadMemoryBudget(default 512 MiB) is drawn down by eachUpload()proportional toblob.UploadSize()and released on exit. Concurrent uploads share the budget: small blobs pack tightly, a max-size blob takes its actual share, and an oversized blob fails fast rather than deadlocking.Why bytes, not blobs
Memory is what we're bounding. Blob sizes vary ~1 MiB to 128 MiB — a "number of blobs" cap is a coarse proxy that either over-admits small blobs (wasting memory headroom by counting them like big ones) or under-admits them (wasting concurrency). Byte accounting collapses this into one honest knob: "how much RAM am I willing to spend on in-flight upload buffers?"
Why this is separate from #7154
#7154 removed the old
UploadConcurrencysemaphore because its unit (one RPC) didn't match any real resource — wrong abstraction for failure isolation. The memory-admission concern it implicitly addressed is a different problem with a different right answer, and deserves its own PR and defaults discussion.Change
ClientConfig.UploadMemoryBudget(int64 bytes, default 512 MiB) withValidate()clamping.Client.uploadBudgetis agolang.org/x/sync/semaphore.Weighted(already an indirect dep; now direct).Upload()reservesblob.UploadSize()at entry viaAcquire(ctx, n)andRelease(n)on exit. An oversized blob returns a clear error; ctx cancellation during wait returnsctx.Err().Sizing guidance
The default (512 MiB) accommodates roughly 4 concurrent max-size (128 MiB) blobs on a validator-grade node. Tune against:
An oversized blob (
UploadSize > UploadMemoryBudget) returns an error atUpload()entry — a config that would otherwise deadlock is surfaced immediately.Depends on
#7154 — builds on the
ClientConfig/Clientsurface introduced there. Base branch ischore/fibre-upload-isolation.Test plan
🤖 Generated with Claude Code