fix: subscription webhook delivery stalling on HTTP errors by sublimator · Pull Request #677 · Xahau/xahaud

sublimator · 2026-02-10T00:38:13Z

Note

XRPLF/rippled#6344 <- most up-to-date work, will review/merge first

Summary

Fix subscription webhook (url) delivery permanently stalling when endpoints
return HTTP errors (500, 404, etc.), affecting all subscribers.

Two bugs fixed:

RPCSub unbounded concurrency — sendThread() fired all queued events
as concurrent async HTTP connections with no flow control, exhausting FDs
HTTPClient EOF completion leak — responses without Content-Length
never invoked the completion callback, holding sockets open for 30s

Upstream issue: XRPLF/rippled#6341

Bug 1: RPCSub Unbounded Concurrency

sendThread() passed the app's shared io_service to fromNetwork(), which
posts async handlers and returns immediately — no .run(), no waiting:

📍 src/ripple/net/impl/RPCCall.cpp:1952-1975

1952     HTTPClient::request(
1953         bSSL,
1954         io_service,
1955         strIp,
1956         iPort,
1957         std::bind(
1958             &RPCCallImp::onRequest,
1959             strMethod,
1960             jvParams,
1961             headers,
1962             strPath,
1963             std::placeholders::_1,
1964             std::placeholders::_2,
1965             j),
1966         RPC_REPLY_MAX_BYTES,
1967         RPC_NOTIFY,
1968         std::bind(
1969             &RPCCallImp::onResponse,
1970             callbackFuncP,
1971             std::placeholders::_1,
1972             std::placeholders::_2,
1973             std::placeholders::_3,
1974             j),
1975         j);

The entire deque was drained at full speed, firing one async HTTP connection
per event. Under sustained errors, each connection held an FD for 30s. At
100+ events/ledger every ~4s, this quickly exhausted the 1024 FD budget.

Fix

Use a local io_service per batch (same pattern as rpcClient()) with
bounded concurrency (32 in-flight) and a queue cap (16384 events):

📍 src/ripple/net/impl/RPCSub.cpp:138-206

 138 void
 139     sendThread()
 140     {
 141         bool bSend;
 142 
 143         do
 144         {
 145             // Local io_service per batch — cheap to create (just an
 146             // internal event queue, no threads, no syscalls). Using a
 147             // local rather than the app's m_io_service is what makes
 148             // .run() block until exactly this batch completes, giving
 149             // us flow control. Same pattern used by rpcClient() in
 150             // RPCCall.cpp for CLI commands.
 151             boost::asio::io_service io_service;
 152             int dispatched = 0;
 153 
 154             {
 155                 std::lock_guard sl(mLock);
 156 
 157                 while (!mDeque.empty() && dispatched < maxInFlight)
 158                 {
 159                     auto const [seq, env] = mDeque.front();
 160                     mDeque.pop_front();
 161 
 162                     Json::Value jvEvent = env;
 163                     jvEvent["seq"] = seq;
 164 
 165                     RPCCall::fromNetwork(
 166                         io_service,
 167                         mIp,
 168                         mPort,
 169                         mUsername,
 170                         mPassword,
 171                         mPath,
 172                         "event",
 173                         jvEvent,
 174                         mSSL,
 175                         true,
 176                         logs_);
 177                     ++dispatched;
 178                 }
 179 
 180                 if (dispatched == 0)
 181                     mSending = false;
 182             }
 183 
 184             bSend = dispatched > 0;
 185 
 186             if (bSend)
 187             {
 188                 try
 189                 {
 190                     JLOG(j_.info())
 191                         << "RPCCall::fromNetwork: " << mIp << " dispatching "
 192                         << dispatched << " events";
 193                     io_service.run();
 194                 }
 195                 catch (const std::exception& e)
 196                 {
 197                     JLOG(j_.warn())
 198                         << "RPCCall::fromNetwork exception: " << e.what();
 199                 }
 200                 catch (...)
 201                 {
 202                     JLOG(j_.warn()) << "RPCCall::fromNetwork unknown exception";
 203                 }
 204             }
 205         } while (bSend);
 206     }

📍 src/ripple/net/impl/RPCSub.cpp:76-105

  76 void
  77     send(Json::Value const& jvObj, bool broadcast) override
  78     {
  79         std::lock_guard sl(mLock);
  80 
  81         if (mDeque.size() >= maxQueueSize)
  82         {
  83             JLOG(j_.warn())
  84                 << "RPCCall::fromNetwork drop: queue full (" << mDeque.size()
  85                 << "), seq=" << mSeq << ", endpoint=" << mIp;
  86             ++mSeq;
  87             return;
  88         }
  89 
  90         auto jm = broadcast ? j_.debug() : j_.info();
  91         JLOG(jm) << "RPCCall::fromNetwork push: " << jvObj;
  92 
  93         mDeque.push_back(std::make_pair(mSeq++, jvObj));
  94 
  95         if (!mSending)
  96         {
  97             // Start a sending thread.
  98             JLOG(j_.info()) << "RPCCall::fromNetwork start";
  99 
 100             mSending = m_jobQueue.addJob(
 101                 jtCLIENT_SUBSCRIBE, "RPCSub::sendThread", [this]() {
 102                     sendThread();
 103                 });
 104         }
 105     }

Bug 2: HTTPClient EOF Completion Leak

When an HTTP response has no Content-Length header, HTTPClient reads until
EOF. The EOF path in handleData logged "Complete." but never called
invokeComplete():

📍 src/ripple/net/impl/HTTPClient.cpp:435-470

 435 void
 436     handleData(
 437         const boost::system::error_code& ecResult,
 438         std::size_t bytes_transferred)
 439     {
 440         if (!mShutdown)
 441             mShutdown = ecResult;
 442 
 443         if (mShutdown && mShutdown != boost::asio::error::eof)
 444         {
 445             JLOG(j_.trace()) << "Read error: " << mShutdown.message();
 446 
 447             invokeComplete(mShutdown);
 448         }
 449         else
 450         {
 451             if (mShutdown)
 452             {
 453                 JLOG(j_.trace()) << "Complete.";
 454 
 455                 mResponse.commit(bytes_transferred);
 456                 std::string strBody{
 457                     {std::istreambuf_iterator<char>(&mResponse)},
 458                     std::istreambuf_iterator<char>()};
 459                 invokeComplete(ecResult, mStatus, mBody + strBody);
 460             }
 461             else
 462             {
 463                 mResponse.commit(bytes_transferred);
 464                 std::string strBody{
 465                     {std::istreambuf_iterator<char>(&mResponse)},
 466                     std::istreambuf_iterator<char>()};
 467                 invokeComplete(ecResult, mStatus, mBody + strBody);
 468             }
 469         }
 470     }

Many web frameworks omit Content-Length on error responses, so this path
is hit frequently by failing webhook endpoints. The deadline timer does shut
down the socket, but handleShutdown never calls invokeComplete() either —
and when the cancelled async_read re-enters handleData, mShutdown is
already set to eof from the first call, so it falls into the same dead
EOF branch again. The completion callback is never invoked. With a local
io_service, .run() eventually returns when all handlers drain, but on
the app's shared io_service (original code), the HTTPClientImp objects
are kept alive by shared_from_this() captures indefinitely.

Tests

Added HTTPClient_test with 12 test cases covering resource cleanup across:
success, HTTP 500, connection refused, timeout, server close, concurrent
requests, and EOF-without-Content-Length (which confirmed the bug).

Test Plan

Build succeeds
HTTPClient_test passes (12/12 cases, 0 failures)
Connected to Xahau mainnet with x-testnet create-config --network mainnet --hooks-server
Ran x-testnet hooks-server --error 500:0.5 (50% HTTP 500 responses)
Confirmed events continue flowing through errors
Review: consider adding backoff/limit for repeatedly failing endpoints (follow-up)

fromNetwork() is async — it posts handlers to the io_service and returns immediately. The original sendThread() loop fires all queued events as concurrent HTTP connections at once. Under sustained load with a slow/failing endpoint, connections accumulate (each held up to 30s by RPC_NOTIFY timeout), exhausting file descriptors and breaking all network I/O for the entire process. Fix: use a local io_service per batch with .run() to block until the batch completes (same pattern as rpcClient() in RPCCall.cpp). This bounds concurrent connections to maxInFlight (32) per subscriber while still allowing parallel delivery. Also add a queue cap (maxQueueSize = 16384, ~80-160MB) so a hopelessly behind endpoint doesn't grow the deque indefinitely. Consumers detect gaps via the existing seq field. Ref: XRPLF/rippled#6341

When an HTTP response has no Content-Length header, HTTPClient reads until EOF. The EOF path in handleData logged "Complete." but never called invokeComplete(), leaving the socket held open for the full 30s deadline timeout and the completion callback never invoked. This is the likely root cause of webhook delivery permanently stalling after repeated 500 errors — many web frameworks omit Content-Length on error responses, triggering this path. Each leaked socket holds an FD for 30s, eventually exhausting the process FD budget. Includes HTTPClient_test with 12 test cases covering resource cleanup across success, error, timeout, connection-refused, concurrent request, and EOF-without-Content-Length scenarios.

- Advance mSeq when dropping events so consumers can detect gaps via sequence numbers, and log the dropped seq - Use ephemeral port (bind + close) instead of hardcoded 19999 for the connection-refused test to avoid false negatives on busy machines

Add test.net > ripple.basics dependency introduced by the new test file.

sendThread() now uses a local io_service per batch, so the app's io_service passed via make_RPCSub is dead code. Removes it from the header, constructor, factory function, and sole call site.

sublimator added 4 commits February 9, 2026 18:31

chore: update levelization ordering for HTTPClient_test

a6dfb40

Add test.net > ripple.basics dependency introduced by the new test file.

sublimator marked this pull request as ready for review February 10, 2026 01:32

refactor: remove unused io_service parameter from RPCSub

7241fed

sendThread() now uses a local io_service per batch, so the app's io_service passed via make_RPCSub is dead code. Removes it from the header, constructor, factory function, and sole call site.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: subscription webhook delivery stalling on HTTP errors#677

fix: subscription webhook delivery stalling on HTTP errors#677
sublimator wants to merge 5 commits intodevfrom
subscription-hooks-fix

sublimator commented Feb 10, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sublimator commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Note

Summary

Bug 1: RPCSub Unbounded Concurrency

Fix

Bug 2: HTTPClient EOF Completion Leak

Tests

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sublimator commented Feb 10, 2026 •

edited

Loading