fix: subscription webhook delivery stalling on HTTP errors#677
Open
sublimator wants to merge 5 commits intodevfrom
Open
fix: subscription webhook delivery stalling on HTTP errors#677sublimator wants to merge 5 commits intodevfrom
sublimator wants to merge 5 commits intodevfrom
Conversation
fromNetwork() is async — it posts handlers to the io_service and returns immediately. The original sendThread() loop fires all queued events as concurrent HTTP connections at once. Under sustained load with a slow/failing endpoint, connections accumulate (each held up to 30s by RPC_NOTIFY timeout), exhausting file descriptors and breaking all network I/O for the entire process. Fix: use a local io_service per batch with .run() to block until the batch completes (same pattern as rpcClient() in RPCCall.cpp). This bounds concurrent connections to maxInFlight (32) per subscriber while still allowing parallel delivery. Also add a queue cap (maxQueueSize = 16384, ~80-160MB) so a hopelessly behind endpoint doesn't grow the deque indefinitely. Consumers detect gaps via the existing seq field. Ref: XRPLF/rippled#6341
When an HTTP response has no Content-Length header, HTTPClient reads until EOF. The EOF path in handleData logged "Complete." but never called invokeComplete(), leaving the socket held open for the full 30s deadline timeout and the completion callback never invoked. This is the likely root cause of webhook delivery permanently stalling after repeated 500 errors — many web frameworks omit Content-Length on error responses, triggering this path. Each leaked socket holds an FD for 30s, eventually exhausting the process FD budget. Includes HTTPClient_test with 12 test cases covering resource cleanup across success, error, timeout, connection-refused, concurrent request, and EOF-without-Content-Length scenarios.
- Advance mSeq when dropping events so consumers can detect gaps via sequence numbers, and log the dropped seq - Use ephemeral port (bind + close) instead of hardcoded 19999 for the connection-refused test to avoid false negatives on busy machines
Add test.net > ripple.basics dependency introduced by the new test file.
sendThread() now uses a local io_service per batch, so the app's io_service passed via make_RPCSub is dead code. Removes it from the header, constructor, factory function, and sole call site.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
XRPLF/rippled#6344 <- most up-to-date work, will review/merge first
Summary
Fix subscription webhook (
url) delivery permanently stalling when endpointsreturn HTTP errors (500, 404, etc.), affecting all subscribers.
Two bugs fixed:
sendThread()fired all queued eventsas concurrent async HTTP connections with no flow control, exhausting FDs
Content-Lengthnever invoked the completion callback, holding sockets open for 30s
Upstream issue: XRPLF/rippled#6341
Bug 1: RPCSub Unbounded Concurrency
sendThread()passed the app's sharedio_servicetofromNetwork(), whichposts async handlers and returns immediately — no
.run(), no waiting:📍
src/ripple/net/impl/RPCCall.cpp:1952-1975The entire deque was drained at full speed, firing one async HTTP connection
per event. Under sustained errors, each connection held an FD for 30s. At
100+ events/ledger every ~4s, this quickly exhausted the 1024 FD budget.
Fix
Use a local
io_serviceper batch (same pattern asrpcClient()) withbounded concurrency (32 in-flight) and a queue cap (16384 events):
📍
src/ripple/net/impl/RPCSub.cpp:138-206📍
src/ripple/net/impl/RPCSub.cpp:76-105Bug 2: HTTPClient EOF Completion Leak
When an HTTP response has no
Content-Lengthheader,HTTPClientreads untilEOF. The EOF path in
handleDatalogged "Complete." but never calledinvokeComplete():📍
src/ripple/net/impl/HTTPClient.cpp:435-470Many web frameworks omit
Content-Lengthon error responses, so this pathis hit frequently by failing webhook endpoints. The deadline timer does shut
down the socket, but
handleShutdownnever callsinvokeComplete()either —and when the cancelled
async_readre-entershandleData,mShutdownisalready set to
eoffrom the first call, so it falls into the same deadEOF branch again. The completion callback is never invoked. With a local
io_service,.run()eventually returns when all handlers drain, but onthe app's shared
io_service(original code), theHTTPClientImpobjectsare kept alive by
shared_from_this()captures indefinitely.Tests
Added
HTTPClient_testwith 12 test cases covering resource cleanup across:success, HTTP 500, connection refused, timeout, server close, concurrent
requests, and EOF-without-Content-Length (which confirmed the bug).
Test Plan
HTTPClient_testpasses (12/12 cases, 0 failures)x-testnet create-config --network mainnet --hooks-serverx-testnet hooks-server --error 500:0.5(50% HTTP 500 responses)