CCBC-1702: requeue ops when vbmap briefly has no master

avsej · avsej · commit faf66a339124 · 2026-05-11T20:03:10.000Z
The historical default LCB_RETRY_ON_MISSINGNODE = 0 made retryq fail ops with LCB_ERR_NO_MATCHING_SERVER whenever lcbvb_vbmaster() returned -1 or returned an index >= cq->npipelines for the requested vbucket. That happens whenever the cluster's view of the vbmap is in transition -- e.g., immediately after a failover, before a replica is promoted, or during the brief window when the cmdq is in the middle of swapping in a new pipeline array via replace_config(). This is not a CBS 6.0-specific bug. The race exists on every server version. CBS 6.0 just exposes it more reliably because no PROTOCOL_BINARY_FEATURE_DUPLEX means no server-driven config push, no idle inbound read watcher, and config arrives later via bgpoll (default 2.5 s) -- giving retryq more chances to fire inside the replace window. On newer servers the duplex push path drives a single atomic config replace from the read callback, hitting the same race far less often. Empirical confirmation from SDKD situational tests vs CBS 6.0.5 + TLS: zero LCB_ERR_TIMEOUT errors observed across the baseline (build-1515) or the CCBC-1701 keepalive build (build-1522), confirming the timeout-detection hypothesis was off-target. Errors are dominated by LCB_ERR_NO_MATCHING_SERVER from this code path and downstream LCB_ERR_DOCUMENT_NOT_FOUND from retried ops landing on a partially-converged map. With the fixes here, build-1523 reports NO_MATCHING_SERVER count 363 -> 0 across the entire suite. Fixes applied ------------- 1) Flip LCB_RETRY_ON_MISSINGNODE default from 0 to LCB_RETRY_CMDS_ALL. Packets whose vbucket has no mapped master are requeued back into the retry queue rather than failed immediately; the op deadline (default 2.5 s) still bounds how long we keep retrying. If the map never recovers within the deadline, the op fails with the original error (preserved as origerr) -- same user-visible code, just not from a scheduling coincidence. Opt-out is unchanged via cntl / connection-string "retry=missingnode=0". 2) Hold a ConfigInfo ref across Server::handle_nmv. The pre-fix code read instance->cmdq.config (a raw lcbvb_CONFIG*) without a refcount, so cccp_update() inside the handler -- or any other config-replace path that fires in a nested event-loop frame -- could decref the old ConfigInfo and free its lcbvb_CONFIG while lcbvb_nmv_remap_ex() was mid-deref. Build-1523 hit this as a SIGSEGV in lcbvb_nmv_remap_ex during FoRecoverDelta scenarios when the (1) requeue change kept ops in flight long enough to reach handle_nmv across a config replacement. The latent UAF was reachable on master too, just rarer; the requeue fix surfaced it. Hold the ref at function entry, decref on every return path. 3) Defensive zeroing in lcbvb_destroy(). After freeing the contained pointers, NULL them out before freeing the struct. Any remaining latent UAF on lcbvb_CONFIG (through cmdq.config, a captured lcbvb_CONFIG*, or a stale Server) now NULL-derefs deterministically at the offending field instead of reading whatever the next allocator hands out. Cheap insurance. 4) Atomic swap in replace_config(). Replaced the take_pipelines/build/add_pipelines pattern -- which left the cmdq with pipelines=NULL and npipelines=0 across an arbitrary number of synchronous Server constructor allocs -- with a single tight swap block at the end of replace_config(). cq->pipelines, cq->npipelines, cq->_npipelines_ex, cq->scheds, and cq->config are now updated as one sequence; the cmdq is never observably in a half-installed state. This is structural defense-in-depth. 5) Bounds guard in lcb_vbguess_remap(). After PS2 still reproduced exactly one SIGSEGV in FoRecoverDelta-SUBDOC at the first deref of cfg->vbuckets[vbid] (build-1524), with cfg = LCBT_VBCONFIG(instance) pointing at a freed-and-poisoned lcbvb_CONFIG (vbuckets=NULL via (3)). The handle_nmv ref guard in (2) pins cur_configinfo but the deref is on cmdq.config; there is at least one SUBDOC code path where the two diverge or where cmdq.config briefly points at a stale vbc. Until that path is fully traced, gate the deref: reject NULL cfg, NULL cfg->vbuckets, or vbid out of [0, cfg->nvb). On reject, the op falls through lcb_kv_should_retry -> mcreq_renew_packet -> retryq->nmvadd, which is the same path as a legitimate "no remap currently available" and waits for the next config to arrive. Logged at WARN so the rejection is visible in production logs. Tests ----- tests/iotests/t_netfail.cc adds two cases. Both reproduce the failure mode without iptables. testRetryOnMissingNodeAfterMapRepair: stops the config monitor, sets vbuckets[vb].servers[0] = -1 to force lcbvb_vbmaster() to return -1 (the precise condition retryq.cc:264 trips on), schedules a 200 ms timer that restores the master in place, and issues an lcb_get with a 2 s deadline. On gerrit/master (MISSINGNODE = 0) retryq fails the op at the first 10 ms tick with LCB_ERR_NO_MATCHING_SERVER. With (1) applied, the op stays in retryq across ticks; once the timer restores the map, the next tick dispatches it on the correct pipeline and the GET succeeds. testConfigReplaceMidRetry: same setup, but instead of restoring the live vbc in place, snapshots the vbc to JSON via lcbvb_save_json(), loads it into a brand-new lcbvb_CONFIG, wraps it in a ConfigInfo, and calls lcb_update_vbconfig() from the timer. This exercises the full replace_config() path: cur_configinfo is swapped, replace_config() moves pipelines, the old ConfigInfo is decref'd, and the old lcbvb_CONFIG is freed via lcbvb_destroy(). Without (4) the cmdq would have been visibly inconsistent during the swap; without (2) and (3) any captured raw lcbvb_CONFIG* across the call would UAF on the freed memory. With all five fixes, the post-swap retryq tick dispatches on the new pipelines and the GET succeeds. The pre-existing testNegativeIndex still passes; the op now waits the full 500 ms op_timeout before failing with NO_MATCHING_SERVER (preserved as origerr) instead of failing fast. Verified locally on libev/libevent IO plugins: full unit-tests suite (194/194, modulo 1 environment-only Behavior.PluginDefaults skip) green. Files touched: src/settings.{cc,h}, src/mcserver/mcserver.cc, src/newconfig.cc, src/vbucket/vbucket.c, tests/iotests/t_netfail.cc. Change-Id: I83c8c95a1b279082cd750d8363664f14a78aa0c5 Reviewed-on: https://review.couchbase.org/c/libcouchbase/+/244745 Tested-by: Build Bot <build@couchbase.com> Reviewed-by: Sergey Avseyev <sergey.avseyev@gmail.com>
diff --git a/src/mcserver/mcserver.cc b/src/mcserver/mcserver.cc
@@ -144,6 +144,26 @@ bool Server::handle_nmv(MemcachedResponse &resinfo, mc_PACKET *oldpkt)
 
     MC_INCR_METRIC(this, packets_nmv, 1);
 
+    /* CCBC-1702: pin the current config for the duration of this call.
+     *
+     * Without this ref, lcb_vbguess_remap() and the NMV-driven cccp_update()
+     * below can race: cccp_update() (or any other config-replace path that
+     * runs in a nested event-handler stack frame) decrefs the old
+     * lcb_pCONFIGINFO and lcbvb_destroy() frees its lcbvb_CONFIG. A
+     * subsequent dereference of that lcbvb_CONFIG -- via cmdq.config or via
+     * a cached pointer -- is then a UAF. We have observed this as a SIGSEGV
+     * inside lcbvb_nmv_remap_ex during FoRecoverDelta scenarios when the
+     * retryq keeps ops in flight long enough to reach this handler.
+     *
+     * Holding a ref here keeps the at-entry config (and its vbc) alive
+     * until we return, which is sufficient: even if cur_configinfo is
+     * swapped to a new ConfigInfo during the call, the new one carries its
+     * own ref via lcb_update_vbconfig(), so cmdq.config remains valid. */
+    auto *info_at_entry = instance->cur_configinfo;
+    if (info_at_entry) {
+        info_at_entry->incref();
+    }
+
     mcreq_read_hdr(oldpkt, &hdr);
     vbid = ntohs(hdr.request.vbucket);
     lcb_log(LOGARGS_T(WARN), LOGFMT "NOT_MY_VBUCKET. Packet=%p (S=%u). VBID=%u, has_config=%s", LOGID_T(),
@@ -178,13 +198,19 @@ bool Server::handle_nmv(MemcachedResponse &resinfo, mc_PACKET *oldpkt)
     }
     lcb_RETRY_ACTION retry = lcb_kv_should_retry(settings, oldpkt, LCB_ERR_NOT_MY_VBUCKET);
     if (!retry.should_retry) {
+        if (info_at_entry) {
+            info_at_entry->decref();
+        }
         return false;
     }
 
     /** Reschedule the packet again .. */
     mc_PACKET *newpkt = mcreq_renew_packet(oldpkt);
     newpkt->flags &= ~MCREQ_STATE_FLAGS;
     instance->retryq->nmvadd((mc_EXPACKET *)newpkt);
+    if (info_at_entry) {
+        info_at_entry->decref();
+    }
     return true;
 }
 
diff --git a/src/newconfig.cc b/src/newconfig.cc
@@ -99,8 +99,35 @@ int lcb_vbguess_remap(lcb_INSTANCE *instance, int vbid, int bad)
         return -1;
     }
 
+    /* CCBC-1702: defensive bounds check.
+     *
+     * Server::handle_nmv pins instance->cur_configinfo for the duration of
+     * the call, which is sufficient for the common case where cmdq.config
+     * == cur_configinfo->vbc. But empirically (build-1524 FoRecoverDelta-
+     * SUBDOC SIGSEGV at the deref of cfg->vbuckets[vbid]), there is still
+     * at least one path where cmdq.config points at a lcbvb_CONFIG that
+     * has been freed via lcbvb_destroy() while a SUBDOC response handler
+     * is mid-call. The defensive zeroing in lcbvb_destroy() leaves a
+     * stale cfg with vbuckets=NULL/nvb=0; without this guard, the
+     * subsequent deref is a NULL+offset SIGSEGV.
+     *
+     * Returning -1 here funnels the op into the same path as a legitimate
+     * "no remap currently available": lcb_kv_should_retry ->
+     * mcreq_renew_packet -> retryq->nmvadd, where it waits for the next
+     * config (which is in-flight at this exact moment, since the only
+     * path that frees the old vbc is lcb_update_vbconfig()) and retries
+     * against fresh pipelines. */
+    lcbvb_CONFIG *cfg = LCBT_VBCONFIG(instance);
+    if (cfg == nullptr || cfg->vbuckets == nullptr || vbid < 0 || (unsigned)vbid >= cfg->nvb) {
+        lcb_log(LOGARGS(instance, WARN),
+                "vbguess_remap: rejecting deref of stale or empty vbucket map "
+                "(cfg=%p, vbuckets=%p, nvb=%u, vbid=%d). Op will be requeued.",
+                (void *)cfg, cfg ? (void *)cfg->vbuckets : nullptr, cfg ? cfg->nvb : 0u, vbid);
+        return -1;
+    }
+
     if (LCBT_SETTING(instance, vb_noguess)) {
-        int newix = lcbvb_nmv_remap_ex(LCBT_VBCONFIG(instance), vbid, bad, 0);
+        int newix = lcbvb_nmv_remap_ex(cfg, vbid, bad, 0);
         if (newix > -1 && newix != bad) {
             lcb_log(LOGARGS(instance, TRACE), "Got new index from ffmap. VBID=%d. Old=%d. New=%d", vbid, bad, newix);
         }
@@ -109,11 +136,10 @@ int lcb_vbguess_remap(lcb_INSTANCE *instance, int vbid, int bad)
     } else {
         lcb_GUESSVB *guesses = instance->vbguess;
         if (!guesses) {
-            guesses = instance->vbguess =
-                reinterpret_cast<lcb_GUESSVB *>(calloc(LCBT_VBCONFIG(instance)->nvb, sizeof(lcb_GUESSVB)));
+            guesses = instance->vbguess = reinterpret_cast<lcb_GUESSVB *>(calloc(cfg->nvb, sizeof(lcb_GUESSVB)));
         }
         lcb_GUESSVB *guess = guesses + vbid;
-        int newix = lcbvb_nmv_remap_ex(LCBT_VBCONFIG(instance), vbid, bad, 1);
+        int newix = lcbvb_nmv_remap_ex(cfg, vbid, bad, 1);
         if (newix > -1 && newix != bad) {
             guess->newix = static_cast<char>(newix);
             guess->oldix = static_cast<char>(bad);
@@ -238,74 +264,119 @@ static int iterwipe_cb(mc_CMDQUEUE *cq, mc_PIPELINE *oldpl, mc_PACKET *oldpkt, v
     return MCREQ_REMOVE_PACKET;
 }
 
+/* CCBC-1702: structural cleanup of the replace path.
+ *
+ * The pre-fix flow used mcreq_queue_take_pipelines() to NULL out
+ * cq->pipelines and zero cq->npipelines, then built the new pipeline
+ * array, then mcreq_queue_add_pipelines() to install it. While LCB is
+ * single-threaded and the event loop cannot dispatch a retryq tick
+ * inside this call frame, leaving the cmdq in pipelines=NULL /
+ * npipelines=0 across an arbitrary number of synchronous heap allocs
+ * (the new lcb::Server constructors) is brittle: any future change
+ * that introduces a synchronous reader of cq state in a Server ctor or
+ * in find_new_data_index() would observe a transient inconsistent
+ * cmdq.
+ *
+ * This rewrite performs the swap as a single tight sequence at the
+ * end, after ppnew is fully built. Old slots that were not kept are
+ * tracked via a parallel bitmap (`moved[]`) instead of by writing NULL
+ * into cq->pipelines, so the live cq->pipelines buffer is not modified
+ * before the swap. The retry-policy fix in settings.cc is the
+ * load-bearing change for the visible behaviour; this is the
+ * belt-and-suspenders. */
 static void replace_config(lcb_INSTANCE *instance, lcbvb_CONFIG *oldconfig, lcbvb_CONFIG *newconfig)
 {
     mc_CMDQUEUE *cq = &instance->cmdq;
-    mc_PIPELINE **ppold, **ppnew;
-    unsigned ii, nold, nnew;
 
     lcb_assert(LCBT_VBCONFIG(instance) == newconfig);
 
-    nnew = LCBVB_NSERVERS(newconfig);
-    ppnew = reinterpret_cast<mc_PIPELINE **>(calloc(nnew, sizeof(*ppnew)));
-    ppold = mcreq_queue_take_pipelines(cq, &nold);
+    unsigned nnew = LCBVB_NSERVERS(newconfig);
+    mc_PIPELINE **ppnew = reinterpret_cast<mc_PIPELINE **>(calloc(nnew, sizeof(mc_PIPELINE *)));
+
+    /* Snapshot the existing pipelines without disturbing cq. */
+    mc_PIPELINE **old_pipelines_buf = cq->pipelines;
+    unsigned nold = cq->npipelines;
+    bool *moved = reinterpret_cast<bool *>(calloc(nold ? nold : 1, sizeof(bool)));
 
-    /**
-     * Determine which existing servers are still part of the new cluster config
-     * and place it inside the new list.
-     */
-    for (ii = 0; ii < nold; ii++) {
-        auto *cur = static_cast<lcb::Server *>(ppold[ii]);
+    /* Determine which existing servers are still part of the new cluster
+     * config and place them in the new list. */
+    for (unsigned ii = 0; ii < nold; ii++) {
+        auto *cur = static_cast<lcb::Server *>(old_pipelines_buf[ii]);
         int newix = find_new_data_index(oldconfig, newconfig, cur);
         if (newix > -1) {
             cur->set_new_index(newix);
             ppnew[newix] = cur;
-            ppold[ii] = nullptr;
+            moved[ii] = true;
             lcb_log(LOGARGS(instance, INFO), "Reusing server " SERVER_FMT ". OldIndex=%d. NewIndex=%d",
                     SERVER_ARGS(cur), ii, newix);
         }
     }
 
-    /**
-     * Once we've moved the kept servers to the new list, allocate new lcb::Server
-     * structures for slots that don't have an existing lcb::Server. We must do
-     * this before add_pipelines() is called, so that there are no holes inside
-     * ppnew
-     */
-    for (ii = 0; ii < nnew; ii++) {
+    /* Allocate new lcb::Server structures for slots that do not have one. */
+    for (unsigned ii = 0; ii < nnew; ii++) {
         if (!ppnew[ii]) {
             ppnew[ii] = new lcb::Server(instance, static_cast<int>(ii));
         }
     }
 
-    /**
-     * Once we have all the server structures in place for the new config,
-     * transfer the new config along with the new list over to the CQ structure.
-     */
-    mcreq_queue_add_pipelines(cq, ppnew, nnew, newconfig);
-
-    /**
-     * Go through all the servers that are to be removed and relocate commands
-     * from their queues into the new queues
-     */
-    for (ii = 0; ii < nold; ii++) {
-        if (!ppold[ii]) {
-            continue;
+    /* Atomic swap of cq state. From the next instruction onward, any
+     * reader sees the fully-installed new pipelines, npipelines, scheds,
+     * and config. */
+    {
+        size_t pl_bytes = sizeof(mc_PIPELINE *) * (nnew + 1);
+        mc_PIPELINE **new_queue_pipelines = reinterpret_cast<mc_PIPELINE **>(malloc(pl_bytes));
+        memcpy(new_queue_pipelines, ppnew, sizeof(mc_PIPELINE *) * nnew);
+
+        unsigned new_ex = nnew;
+        if (cq->fallback) {
+            cq->fallback->index = nnew;
+            new_queue_pipelines[nnew] = cq->fallback;
+            new_ex++;
         }
 
-        mcreq_iterwipe(cq, ppold[ii], iterwipe_cb, nullptr);
-        static_cast<lcb::Server *>(ppold[ii])->purge(LCB_ERR_MAP_CHANGED);
-        static_cast<lcb::Server *>(ppold[ii])->close();
+        for (unsigned ii = 0; ii < nnew; ii++) {
+            ppnew[ii]->parent = cq;
+            ppnew[ii]->index = ii;
+        }
+
+        char *new_scheds = reinterpret_cast<char *>(calloc(nnew + 1, sizeof(char)));
+        char *old_scheds = cq->scheds;
+
+        /* Note: we do NOT free cq->pipelines here. old_pipelines_buf
+         * aliases that buffer and the drain loop below still walks it
+         * via old_pipelines_buf[ii]. The free is deferred to the bottom
+         * of this function. */
+        cq->pipelines = new_queue_pipelines;
+        cq->npipelines = nnew;
+        cq->_npipelines_ex = new_ex;
+        cq->scheds = new_scheds;
+        cq->config = newconfig;
+
+        free(old_scheds);
+    }
+
+    /* Drain old pipelines that were not carried over: relocate their
+     * pending packets onto the new pipelines (mcreq_iterwipe ->
+     * iterwipe_cb), purge any that cannot be relocated, then close the
+     * pipeline. */
+    for (unsigned ii = 0; ii < nold; ii++) {
+        if (moved[ii]) {
+            continue;
+        }
+        mcreq_iterwipe(cq, old_pipelines_buf[ii], iterwipe_cb, nullptr);
+        static_cast<lcb::Server *>(old_pipelines_buf[ii])->purge(LCB_ERR_MAP_CHANGED);
+        static_cast<lcb::Server *>(old_pipelines_buf[ii])->close();
     }
 
-    for (ii = 0; ii < nnew; ii++) {
+    for (unsigned ii = 0; ii < nnew; ii++) {
         if (static_cast<lcb::Server *>(ppnew[ii])->has_pending()) {
             ppnew[ii]->flush_start(ppnew[ii]);
         }
     }
 
+    free(moved);
     free(ppnew);
-    free(ppold);
+    free(old_pipelines_buf);
 }
 
 void lcb_update_vbconfig(lcb_INSTANCE *instance, lcb_pCONFIGINFO config)
diff --git a/src/settings.cc b/src/settings.cc
@@ -46,7 +46,7 @@ void lcb_default_settings(lcb_settings *settings)
     settings->retry[LCB_RETRY_ON_SOCKERR] = LCB_DEFAULT_NETRETRY;
     settings->retry[LCB_RETRY_ON_TOPOCHANGE] = LCB_DEFAULT_TOPORETRY;
     settings->retry[LCB_RETRY_ON_VBMAPERR] = LCB_DEFAULT_NMVRETRY;
-    settings->retry[LCB_RETRY_ON_MISSINGNODE] = 0;
+    settings->retry[LCB_RETRY_ON_MISSINGNODE] = LCB_DEFAULT_MISSINGNODERETRY;
     settings->bc_http_urltype = LCB_DEFAULT_HTCONFIG_URLTYPE;
     settings->compressopts = LCB_DEFAULT_COMPRESSOPTS;
     settings->compress_min_size = LCB_DEFAULT_COMPRESS_MIN_SIZE;
diff --git a/src/settings.h b/src/settings.h
@@ -80,6 +80,15 @@
 #define LCB_DEFAULT_TOPORETRY LCB_RETRY_CMDS_ALL
 #define LCB_DEFAULT_NETRETRY LCB_RETRY_CMDS_ALL
 #define LCB_DEFAULT_NMVRETRY LCB_RETRY_CMDS_ALL
+/* Retry an op against the retry queue when the vbucket map briefly has no
+ * master mapped (srvix < 0 || srvix >= cq->npipelines). The default flipped
+ * from 0 to 1 in CCBC-1702 because, during replace_config(), the cmdq
+ * transiently has cq->npipelines == 0 between take_pipelines() and
+ * add_pipelines(); any retryq tick inside that window would otherwise fail
+ * the op with LCB_ERR_NO_MATCHING_SERVER even though a healthy map is
+ * about to be installed. The op deadline still bounds how long we keep
+ * retrying. */
+#define LCB_DEFAULT_MISSINGNODERETRY LCB_RETRY_CMDS_ALL
 #define LCB_DEFAULT_HTCONFIG_URLTYPE LCB_HTCONFIG_URLTYPE_TRYALL
 #define LCB_DEFAULT_COMPRESSOPTS LCB_COMPRESS_INOUT
 
diff --git a/src/vbucket/vbucket.c b/src/vbucket/vbucket.c
@@ -900,6 +900,23 @@ void lcbvb_destroy(lcbvb_CONFIG *conf)
     free(conf->vbuckets);
     free(conf->ffvbuckets);
     free(conf->randbuf);
+    /* CCBC-1702: poison freed pointers so that any latent UAF on this
+     * struct (e.g. through cmdq.config or a captured lcbvb_CONFIG*) faults
+     * deterministically at the offending field rather than reading garbage
+     * from a recycled allocation. The struct itself is freed below; if
+     * anyone deref'd the struct after this returns, AddressSanitizer would
+     * have caught it -- but in production builds, NULL-deref is a far more
+     * actionable signal than reading whatever the next allocator hands out. */
+    conf->servers = NULL;
+    conf->continuum = NULL;
+    conf->buuid = NULL;
+    conf->bname = NULL;
+    conf->vbuckets = NULL;
+    conf->ffvbuckets = NULL;
+    conf->randbuf = NULL;
+    conf->nsrv = 0;
+    conf->ndatasrv = 0;
+    conf->nvb = 0;
     free(conf);
 }
 
diff --git a/tests/iotests/t_netfail.cc b/tests/iotests/t_netfail.cc