Skip to content

fix: subscription webhook delivery stalling on HTTP errors#677

Open
sublimator wants to merge 5 commits intodevfrom
subscription-hooks-fix
Open

fix: subscription webhook delivery stalling on HTTP errors#677
sublimator wants to merge 5 commits intodevfrom
subscription-hooks-fix

Conversation

@sublimator
Copy link
Collaborator

@sublimator sublimator commented Feb 10, 2026

Note

XRPLF/rippled#6344 <- most up-to-date work, will review/merge first

Summary

Fix subscription webhook (url) delivery permanently stalling when endpoints
return HTTP errors (500, 404, etc.), affecting all subscribers.

Two bugs fixed:

  1. RPCSub unbounded concurrencysendThread() fired all queued events
    as concurrent async HTTP connections with no flow control, exhausting FDs
  2. HTTPClient EOF completion leak — responses without Content-Length
    never invoked the completion callback, holding sockets open for 30s

Upstream issue: XRPLF/rippled#6341

Bug 1: RPCSub Unbounded Concurrency

sendThread() passed the app's shared io_service to fromNetwork(), which
posts async handlers and returns immediately — no .run(), no waiting:

📍 src/ripple/net/impl/RPCCall.cpp:1952-1975

1952     HTTPClient::request(
1953         bSSL,
1954         io_service,
1955         strIp,
1956         iPort,
1957         std::bind(
1958             &RPCCallImp::onRequest,
1959             strMethod,
1960             jvParams,
1961             headers,
1962             strPath,
1963             std::placeholders::_1,
1964             std::placeholders::_2,
1965             j),
1966         RPC_REPLY_MAX_BYTES,
1967         RPC_NOTIFY,
1968         std::bind(
1969             &RPCCallImp::onResponse,
1970             callbackFuncP,
1971             std::placeholders::_1,
1972             std::placeholders::_2,
1973             std::placeholders::_3,
1974             j),
1975         j);

The entire deque was drained at full speed, firing one async HTTP connection
per event. Under sustained errors, each connection held an FD for 30s. At
100+ events/ledger every ~4s, this quickly exhausted the 1024 FD budget.

Fix

Use a local io_service per batch (same pattern as rpcClient()) with
bounded concurrency (32 in-flight) and a queue cap (16384 events):

📍 src/ripple/net/impl/RPCSub.cpp:138-206

 138 void
 139     sendThread()
 140     {
 141         bool bSend;
 142 
 143         do
 144         {
 145             // Local io_service per batch — cheap to create (just an
 146             // internal event queue, no threads, no syscalls). Using a
 147             // local rather than the app's m_io_service is what makes
 148             // .run() block until exactly this batch completes, giving
 149             // us flow control. Same pattern used by rpcClient() in
 150             // RPCCall.cpp for CLI commands.
 151             boost::asio::io_service io_service;
 152             int dispatched = 0;
 153 
 154             {
 155                 std::lock_guard sl(mLock);
 156 
 157                 while (!mDeque.empty() && dispatched < maxInFlight)
 158                 {
 159                     auto const [seq, env] = mDeque.front();
 160                     mDeque.pop_front();
 161 
 162                     Json::Value jvEvent = env;
 163                     jvEvent["seq"] = seq;
 164 
 165                     RPCCall::fromNetwork(
 166                         io_service,
 167                         mIp,
 168                         mPort,
 169                         mUsername,
 170                         mPassword,
 171                         mPath,
 172                         "event",
 173                         jvEvent,
 174                         mSSL,
 175                         true,
 176                         logs_);
 177                     ++dispatched;
 178                 }
 179 
 180                 if (dispatched == 0)
 181                     mSending = false;
 182             }
 183 
 184             bSend = dispatched > 0;
 185 
 186             if (bSend)
 187             {
 188                 try
 189                 {
 190                     JLOG(j_.info())
 191                         << "RPCCall::fromNetwork: " << mIp << " dispatching "
 192                         << dispatched << " events";
 193                     io_service.run();
 194                 }
 195                 catch (const std::exception& e)
 196                 {
 197                     JLOG(j_.warn())
 198                         << "RPCCall::fromNetwork exception: " << e.what();
 199                 }
 200                 catch (...)
 201                 {
 202                     JLOG(j_.warn()) << "RPCCall::fromNetwork unknown exception";
 203                 }
 204             }
 205         } while (bSend);
 206     }

📍 src/ripple/net/impl/RPCSub.cpp:76-105

  76 void
  77     send(Json::Value const& jvObj, bool broadcast) override
  78     {
  79         std::lock_guard sl(mLock);
  80 
  81         if (mDeque.size() >= maxQueueSize)
  82         {
  83             JLOG(j_.warn())
  84                 << "RPCCall::fromNetwork drop: queue full (" << mDeque.size()
  85                 << "), seq=" << mSeq << ", endpoint=" << mIp;
  86             ++mSeq;
  87             return;
  88         }
  89 
  90         auto jm = broadcast ? j_.debug() : j_.info();
  91         JLOG(jm) << "RPCCall::fromNetwork push: " << jvObj;
  92 
  93         mDeque.push_back(std::make_pair(mSeq++, jvObj));
  94 
  95         if (!mSending)
  96         {
  97             // Start a sending thread.
  98             JLOG(j_.info()) << "RPCCall::fromNetwork start";
  99 
 100             mSending = m_jobQueue.addJob(
 101                 jtCLIENT_SUBSCRIBE, "RPCSub::sendThread", [this]() {
 102                     sendThread();
 103                 });
 104         }
 105     }

Bug 2: HTTPClient EOF Completion Leak

When an HTTP response has no Content-Length header, HTTPClient reads until
EOF. The EOF path in handleData logged "Complete." but never called
invokeComplete()
:

📍 src/ripple/net/impl/HTTPClient.cpp:435-470

 435 void
 436     handleData(
 437         const boost::system::error_code& ecResult,
 438         std::size_t bytes_transferred)
 439     {
 440         if (!mShutdown)
 441             mShutdown = ecResult;
 442 
 443         if (mShutdown && mShutdown != boost::asio::error::eof)
 444         {
 445             JLOG(j_.trace()) << "Read error: " << mShutdown.message();
 446 
 447             invokeComplete(mShutdown);
 448         }
 449         else
 450         {
 451             if (mShutdown)
 452             {
 453                 JLOG(j_.trace()) << "Complete.";
 454 
 455                 mResponse.commit(bytes_transferred);
 456                 std::string strBody{
 457                     {std::istreambuf_iterator<char>(&mResponse)},
 458                     std::istreambuf_iterator<char>()};
 459                 invokeComplete(ecResult, mStatus, mBody + strBody);
 460             }
 461             else
 462             {
 463                 mResponse.commit(bytes_transferred);
 464                 std::string strBody{
 465                     {std::istreambuf_iterator<char>(&mResponse)},
 466                     std::istreambuf_iterator<char>()};
 467                 invokeComplete(ecResult, mStatus, mBody + strBody);
 468             }
 469         }
 470     }

Many web frameworks omit Content-Length on error responses, so this path
is hit frequently by failing webhook endpoints. The deadline timer does shut
down the socket, but handleShutdown never calls invokeComplete() either —
and when the cancelled async_read re-enters handleData, mShutdown is
already set to eof from the first call, so it falls into the same dead
EOF branch again. The completion callback is never invoked. With a local
io_service, .run() eventually returns when all handlers drain, but on
the app's shared io_service (original code), the HTTPClientImp objects
are kept alive by shared_from_this() captures indefinitely.

Tests

Added HTTPClient_test with 12 test cases covering resource cleanup across:
success, HTTP 500, connection refused, timeout, server close, concurrent
requests, and EOF-without-Content-Length (which confirmed the bug).

Test Plan

  • Build succeeds
  • HTTPClient_test passes (12/12 cases, 0 failures)
  • Connected to Xahau mainnet with x-testnet create-config --network mainnet --hooks-server
  • Ran x-testnet hooks-server --error 500:0.5 (50% HTTP 500 responses)
  • Confirmed events continue flowing through errors
  • Review: consider adding backoff/limit for repeatedly failing endpoints (follow-up)

fromNetwork() is async — it posts handlers to the io_service and
returns immediately. The original sendThread() loop fires all queued
events as concurrent HTTP connections at once. Under sustained load
with a slow/failing endpoint, connections accumulate (each held up to
30s by RPC_NOTIFY timeout), exhausting file descriptors and breaking
all network I/O for the entire process.

Fix: use a local io_service per batch with .run() to block until the
batch completes (same pattern as rpcClient() in RPCCall.cpp). This
bounds concurrent connections to maxInFlight (32) per subscriber while
still allowing parallel delivery.

Also add a queue cap (maxQueueSize = 16384, ~80-160MB) so a hopelessly
behind endpoint doesn't grow the deque indefinitely. Consumers detect
gaps via the existing seq field.

Ref: XRPLF/rippled#6341
When an HTTP response has no Content-Length header, HTTPClient reads
until EOF. The EOF path in handleData logged "Complete." but never
called invokeComplete(), leaving the socket held open for the full
30s deadline timeout and the completion callback never invoked.

This is the likely root cause of webhook delivery permanently stalling
after repeated 500 errors — many web frameworks omit Content-Length on
error responses, triggering this path. Each leaked socket holds an FD
for 30s, eventually exhausting the process FD budget.

Includes HTTPClient_test with 12 test cases covering resource cleanup
across success, error, timeout, connection-refused, concurrent request,
and EOF-without-Content-Length scenarios.
- Advance mSeq when dropping events so consumers can detect gaps via
  sequence numbers, and log the dropped seq
- Use ephemeral port (bind + close) instead of hardcoded 19999 for the
  connection-refused test to avoid false negatives on busy machines
Add test.net > ripple.basics dependency introduced by the new test file.
@sublimator sublimator marked this pull request as ready for review February 10, 2026 01:32
sendThread() now uses a local io_service per batch, so the app's
io_service passed via make_RPCSub is dead code. Removes it from the
header, constructor, factory function, and sole call site.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant