Skip to content

Commit faf66a3

Browse files
committed
CCBC-1702: requeue ops when vbmap briefly has no master
The historical default LCB_RETRY_ON_MISSINGNODE = 0 made retryq fail ops with LCB_ERR_NO_MATCHING_SERVER whenever lcbvb_vbmaster() returned -1 or returned an index >= cq->npipelines for the requested vbucket. That happens whenever the cluster's view of the vbmap is in transition -- e.g., immediately after a failover, before a replica is promoted, or during the brief window when the cmdq is in the middle of swapping in a new pipeline array via replace_config(). This is not a CBS 6.0-specific bug. The race exists on every server version. CBS 6.0 just exposes it more reliably because no PROTOCOL_BINARY_FEATURE_DUPLEX means no server-driven config push, no idle inbound read watcher, and config arrives later via bgpoll (default 2.5 s) -- giving retryq more chances to fire inside the replace window. On newer servers the duplex push path drives a single atomic config replace from the read callback, hitting the same race far less often. Empirical confirmation from SDKD situational tests vs CBS 6.0.5 + TLS: zero LCB_ERR_TIMEOUT errors observed across the baseline (build-1515) or the CCBC-1701 keepalive build (build-1522), confirming the timeout-detection hypothesis was off-target. Errors are dominated by LCB_ERR_NO_MATCHING_SERVER from this code path and downstream LCB_ERR_DOCUMENT_NOT_FOUND from retried ops landing on a partially-converged map. With the fixes here, build-1523 reports NO_MATCHING_SERVER count 363 -> 0 across the entire suite. Fixes applied ------------- 1) Flip LCB_RETRY_ON_MISSINGNODE default from 0 to LCB_RETRY_CMDS_ALL. Packets whose vbucket has no mapped master are requeued back into the retry queue rather than failed immediately; the op deadline (default 2.5 s) still bounds how long we keep retrying. If the map never recovers within the deadline, the op fails with the original error (preserved as origerr) -- same user-visible code, just not from a scheduling coincidence. Opt-out is unchanged via cntl / connection-string "retry=missingnode=0". 2) Hold a ConfigInfo ref across Server::handle_nmv. The pre-fix code read instance->cmdq.config (a raw lcbvb_CONFIG*) without a refcount, so cccp_update() inside the handler -- or any other config-replace path that fires in a nested event-loop frame -- could decref the old ConfigInfo and free its lcbvb_CONFIG while lcbvb_nmv_remap_ex() was mid-deref. Build-1523 hit this as a SIGSEGV in lcbvb_nmv_remap_ex during FoRecoverDelta scenarios when the (1) requeue change kept ops in flight long enough to reach handle_nmv across a config replacement. The latent UAF was reachable on master too, just rarer; the requeue fix surfaced it. Hold the ref at function entry, decref on every return path. 3) Defensive zeroing in lcbvb_destroy(). After freeing the contained pointers, NULL them out before freeing the struct. Any remaining latent UAF on lcbvb_CONFIG (through cmdq.config, a captured lcbvb_CONFIG*, or a stale Server) now NULL-derefs deterministically at the offending field instead of reading whatever the next allocator hands out. Cheap insurance. 4) Atomic swap in replace_config(). Replaced the take_pipelines/build/add_pipelines pattern -- which left the cmdq with pipelines=NULL and npipelines=0 across an arbitrary number of synchronous Server constructor allocs -- with a single tight swap block at the end of replace_config(). cq->pipelines, cq->npipelines, cq->_npipelines_ex, cq->scheds, and cq->config are now updated as one sequence; the cmdq is never observably in a half-installed state. This is structural defense-in-depth. 5) Bounds guard in lcb_vbguess_remap(). After PS2 still reproduced exactly one SIGSEGV in FoRecoverDelta-SUBDOC at the first deref of cfg->vbuckets[vbid] (build-1524), with cfg = LCBT_VBCONFIG(instance) pointing at a freed-and-poisoned lcbvb_CONFIG (vbuckets=NULL via (3)). The handle_nmv ref guard in (2) pins cur_configinfo but the deref is on cmdq.config; there is at least one SUBDOC code path where the two diverge or where cmdq.config briefly points at a stale vbc. Until that path is fully traced, gate the deref: reject NULL cfg, NULL cfg->vbuckets, or vbid out of [0, cfg->nvb). On reject, the op falls through lcb_kv_should_retry -> mcreq_renew_packet -> retryq->nmvadd, which is the same path as a legitimate "no remap currently available" and waits for the next config to arrive. Logged at WARN so the rejection is visible in production logs. Tests ----- tests/iotests/t_netfail.cc adds two cases. Both reproduce the failure mode without iptables. testRetryOnMissingNodeAfterMapRepair: stops the config monitor, sets vbuckets[vb].servers[0] = -1 to force lcbvb_vbmaster() to return -1 (the precise condition retryq.cc:264 trips on), schedules a 200 ms timer that restores the master in place, and issues an lcb_get with a 2 s deadline. On gerrit/master (MISSINGNODE = 0) retryq fails the op at the first 10 ms tick with LCB_ERR_NO_MATCHING_SERVER. With (1) applied, the op stays in retryq across ticks; once the timer restores the map, the next tick dispatches it on the correct pipeline and the GET succeeds. testConfigReplaceMidRetry: same setup, but instead of restoring the live vbc in place, snapshots the vbc to JSON via lcbvb_save_json(), loads it into a brand-new lcbvb_CONFIG, wraps it in a ConfigInfo, and calls lcb_update_vbconfig() from the timer. This exercises the full replace_config() path: cur_configinfo is swapped, replace_config() moves pipelines, the old ConfigInfo is decref'd, and the old lcbvb_CONFIG is freed via lcbvb_destroy(). Without (4) the cmdq would have been visibly inconsistent during the swap; without (2) and (3) any captured raw lcbvb_CONFIG* across the call would UAF on the freed memory. With all five fixes, the post-swap retryq tick dispatches on the new pipelines and the GET succeeds. The pre-existing testNegativeIndex still passes; the op now waits the full 500 ms op_timeout before failing with NO_MATCHING_SERVER (preserved as origerr) instead of failing fast. Verified locally on libev/libevent IO plugins: full unit-tests suite (194/194, modulo 1 environment-only Behavior.PluginDefaults skip) green. Files touched: src/settings.{cc,h}, src/mcserver/mcserver.cc, src/newconfig.cc, src/vbucket/vbucket.c, tests/iotests/t_netfail.cc. Change-Id: I83c8c95a1b279082cd750d8363664f14a78aa0c5 Reviewed-on: https://review.couchbase.org/c/libcouchbase/+/244745 Tested-by: Build Bot <build@couchbase.com> Reviewed-by: Sergey Avseyev <sergey.avseyev@gmail.com>
1 parent 3a66a6e commit faf66a3

6 files changed

Lines changed: 412 additions & 42 deletions

File tree

src/mcserver/mcserver.cc

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -144,6 +144,26 @@ bool Server::handle_nmv(MemcachedResponse &resinfo, mc_PACKET *oldpkt)
144144

145145
MC_INCR_METRIC(this, packets_nmv, 1);
146146

147+
/* CCBC-1702: pin the current config for the duration of this call.
148+
*
149+
* Without this ref, lcb_vbguess_remap() and the NMV-driven cccp_update()
150+
* below can race: cccp_update() (or any other config-replace path that
151+
* runs in a nested event-handler stack frame) decrefs the old
152+
* lcb_pCONFIGINFO and lcbvb_destroy() frees its lcbvb_CONFIG. A
153+
* subsequent dereference of that lcbvb_CONFIG -- via cmdq.config or via
154+
* a cached pointer -- is then a UAF. We have observed this as a SIGSEGV
155+
* inside lcbvb_nmv_remap_ex during FoRecoverDelta scenarios when the
156+
* retryq keeps ops in flight long enough to reach this handler.
157+
*
158+
* Holding a ref here keeps the at-entry config (and its vbc) alive
159+
* until we return, which is sufficient: even if cur_configinfo is
160+
* swapped to a new ConfigInfo during the call, the new one carries its
161+
* own ref via lcb_update_vbconfig(), so cmdq.config remains valid. */
162+
auto *info_at_entry = instance->cur_configinfo;
163+
if (info_at_entry) {
164+
info_at_entry->incref();
165+
}
166+
147167
mcreq_read_hdr(oldpkt, &hdr);
148168
vbid = ntohs(hdr.request.vbucket);
149169
lcb_log(LOGARGS_T(WARN), LOGFMT "NOT_MY_VBUCKET. Packet=%p (S=%u). VBID=%u, has_config=%s", LOGID_T(),
@@ -178,13 +198,19 @@ bool Server::handle_nmv(MemcachedResponse &resinfo, mc_PACKET *oldpkt)
178198
}
179199
lcb_RETRY_ACTION retry = lcb_kv_should_retry(settings, oldpkt, LCB_ERR_NOT_MY_VBUCKET);
180200
if (!retry.should_retry) {
201+
if (info_at_entry) {
202+
info_at_entry->decref();
203+
}
181204
return false;
182205
}
183206

184207
/** Reschedule the packet again .. */
185208
mc_PACKET *newpkt = mcreq_renew_packet(oldpkt);
186209
newpkt->flags &= ~MCREQ_STATE_FLAGS;
187210
instance->retryq->nmvadd((mc_EXPACKET *)newpkt);
211+
if (info_at_entry) {
212+
info_at_entry->decref();
213+
}
188214
return true;
189215
}
190216

src/newconfig.cc

Lines changed: 112 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -99,8 +99,35 @@ int lcb_vbguess_remap(lcb_INSTANCE *instance, int vbid, int bad)
9999
return -1;
100100
}
101101

102+
/* CCBC-1702: defensive bounds check.
103+
*
104+
* Server::handle_nmv pins instance->cur_configinfo for the duration of
105+
* the call, which is sufficient for the common case where cmdq.config
106+
* == cur_configinfo->vbc. But empirically (build-1524 FoRecoverDelta-
107+
* SUBDOC SIGSEGV at the deref of cfg->vbuckets[vbid]), there is still
108+
* at least one path where cmdq.config points at a lcbvb_CONFIG that
109+
* has been freed via lcbvb_destroy() while a SUBDOC response handler
110+
* is mid-call. The defensive zeroing in lcbvb_destroy() leaves a
111+
* stale cfg with vbuckets=NULL/nvb=0; without this guard, the
112+
* subsequent deref is a NULL+offset SIGSEGV.
113+
*
114+
* Returning -1 here funnels the op into the same path as a legitimate
115+
* "no remap currently available": lcb_kv_should_retry ->
116+
* mcreq_renew_packet -> retryq->nmvadd, where it waits for the next
117+
* config (which is in-flight at this exact moment, since the only
118+
* path that frees the old vbc is lcb_update_vbconfig()) and retries
119+
* against fresh pipelines. */
120+
lcbvb_CONFIG *cfg = LCBT_VBCONFIG(instance);
121+
if (cfg == nullptr || cfg->vbuckets == nullptr || vbid < 0 || (unsigned)vbid >= cfg->nvb) {
122+
lcb_log(LOGARGS(instance, WARN),
123+
"vbguess_remap: rejecting deref of stale or empty vbucket map "
124+
"(cfg=%p, vbuckets=%p, nvb=%u, vbid=%d). Op will be requeued.",
125+
(void *)cfg, cfg ? (void *)cfg->vbuckets : nullptr, cfg ? cfg->nvb : 0u, vbid);
126+
return -1;
127+
}
128+
102129
if (LCBT_SETTING(instance, vb_noguess)) {
103-
int newix = lcbvb_nmv_remap_ex(LCBT_VBCONFIG(instance), vbid, bad, 0);
130+
int newix = lcbvb_nmv_remap_ex(cfg, vbid, bad, 0);
104131
if (newix > -1 && newix != bad) {
105132
lcb_log(LOGARGS(instance, TRACE), "Got new index from ffmap. VBID=%d. Old=%d. New=%d", vbid, bad, newix);
106133
}
@@ -109,11 +136,10 @@ int lcb_vbguess_remap(lcb_INSTANCE *instance, int vbid, int bad)
109136
} else {
110137
lcb_GUESSVB *guesses = instance->vbguess;
111138
if (!guesses) {
112-
guesses = instance->vbguess =
113-
reinterpret_cast<lcb_GUESSVB *>(calloc(LCBT_VBCONFIG(instance)->nvb, sizeof(lcb_GUESSVB)));
139+
guesses = instance->vbguess = reinterpret_cast<lcb_GUESSVB *>(calloc(cfg->nvb, sizeof(lcb_GUESSVB)));
114140
}
115141
lcb_GUESSVB *guess = guesses + vbid;
116-
int newix = lcbvb_nmv_remap_ex(LCBT_VBCONFIG(instance), vbid, bad, 1);
142+
int newix = lcbvb_nmv_remap_ex(cfg, vbid, bad, 1);
117143
if (newix > -1 && newix != bad) {
118144
guess->newix = static_cast<char>(newix);
119145
guess->oldix = static_cast<char>(bad);
@@ -238,74 +264,119 @@ static int iterwipe_cb(mc_CMDQUEUE *cq, mc_PIPELINE *oldpl, mc_PACKET *oldpkt, v
238264
return MCREQ_REMOVE_PACKET;
239265
}
240266

267+
/* CCBC-1702: structural cleanup of the replace path.
268+
*
269+
* The pre-fix flow used mcreq_queue_take_pipelines() to NULL out
270+
* cq->pipelines and zero cq->npipelines, then built the new pipeline
271+
* array, then mcreq_queue_add_pipelines() to install it. While LCB is
272+
* single-threaded and the event loop cannot dispatch a retryq tick
273+
* inside this call frame, leaving the cmdq in pipelines=NULL /
274+
* npipelines=0 across an arbitrary number of synchronous heap allocs
275+
* (the new lcb::Server constructors) is brittle: any future change
276+
* that introduces a synchronous reader of cq state in a Server ctor or
277+
* in find_new_data_index() would observe a transient inconsistent
278+
* cmdq.
279+
*
280+
* This rewrite performs the swap as a single tight sequence at the
281+
* end, after ppnew is fully built. Old slots that were not kept are
282+
* tracked via a parallel bitmap (`moved[]`) instead of by writing NULL
283+
* into cq->pipelines, so the live cq->pipelines buffer is not modified
284+
* before the swap. The retry-policy fix in settings.cc is the
285+
* load-bearing change for the visible behaviour; this is the
286+
* belt-and-suspenders. */
241287
static void replace_config(lcb_INSTANCE *instance, lcbvb_CONFIG *oldconfig, lcbvb_CONFIG *newconfig)
242288
{
243289
mc_CMDQUEUE *cq = &instance->cmdq;
244-
mc_PIPELINE **ppold, **ppnew;
245-
unsigned ii, nold, nnew;
246290

247291
lcb_assert(LCBT_VBCONFIG(instance) == newconfig);
248292

249-
nnew = LCBVB_NSERVERS(newconfig);
250-
ppnew = reinterpret_cast<mc_PIPELINE **>(calloc(nnew, sizeof(*ppnew)));
251-
ppold = mcreq_queue_take_pipelines(cq, &nold);
293+
unsigned nnew = LCBVB_NSERVERS(newconfig);
294+
mc_PIPELINE **ppnew = reinterpret_cast<mc_PIPELINE **>(calloc(nnew, sizeof(mc_PIPELINE *)));
295+
296+
/* Snapshot the existing pipelines without disturbing cq. */
297+
mc_PIPELINE **old_pipelines_buf = cq->pipelines;
298+
unsigned nold = cq->npipelines;
299+
bool *moved = reinterpret_cast<bool *>(calloc(nold ? nold : 1, sizeof(bool)));
252300

253-
/**
254-
* Determine which existing servers are still part of the new cluster config
255-
* and place it inside the new list.
256-
*/
257-
for (ii = 0; ii < nold; ii++) {
258-
auto *cur = static_cast<lcb::Server *>(ppold[ii]);
301+
/* Determine which existing servers are still part of the new cluster
302+
* config and place them in the new list. */
303+
for (unsigned ii = 0; ii < nold; ii++) {
304+
auto *cur = static_cast<lcb::Server *>(old_pipelines_buf[ii]);
259305
int newix = find_new_data_index(oldconfig, newconfig, cur);
260306
if (newix > -1) {
261307
cur->set_new_index(newix);
262308
ppnew[newix] = cur;
263-
ppold[ii] = nullptr;
309+
moved[ii] = true;
264310
lcb_log(LOGARGS(instance, INFO), "Reusing server " SERVER_FMT ". OldIndex=%d. NewIndex=%d",
265311
SERVER_ARGS(cur), ii, newix);
266312
}
267313
}
268314

269-
/**
270-
* Once we've moved the kept servers to the new list, allocate new lcb::Server
271-
* structures for slots that don't have an existing lcb::Server. We must do
272-
* this before add_pipelines() is called, so that there are no holes inside
273-
* ppnew
274-
*/
275-
for (ii = 0; ii < nnew; ii++) {
315+
/* Allocate new lcb::Server structures for slots that do not have one. */
316+
for (unsigned ii = 0; ii < nnew; ii++) {
276317
if (!ppnew[ii]) {
277318
ppnew[ii] = new lcb::Server(instance, static_cast<int>(ii));
278319
}
279320
}
280321

281-
/**
282-
* Once we have all the server structures in place for the new config,
283-
* transfer the new config along with the new list over to the CQ structure.
284-
*/
285-
mcreq_queue_add_pipelines(cq, ppnew, nnew, newconfig);
286-
287-
/**
288-
* Go through all the servers that are to be removed and relocate commands
289-
* from their queues into the new queues
290-
*/
291-
for (ii = 0; ii < nold; ii++) {
292-
if (!ppold[ii]) {
293-
continue;
322+
/* Atomic swap of cq state. From the next instruction onward, any
323+
* reader sees the fully-installed new pipelines, npipelines, scheds,
324+
* and config. */
325+
{
326+
size_t pl_bytes = sizeof(mc_PIPELINE *) * (nnew + 1);
327+
mc_PIPELINE **new_queue_pipelines = reinterpret_cast<mc_PIPELINE **>(malloc(pl_bytes));
328+
memcpy(new_queue_pipelines, ppnew, sizeof(mc_PIPELINE *) * nnew);
329+
330+
unsigned new_ex = nnew;
331+
if (cq->fallback) {
332+
cq->fallback->index = nnew;
333+
new_queue_pipelines[nnew] = cq->fallback;
334+
new_ex++;
294335
}
295336

296-
mcreq_iterwipe(cq, ppold[ii], iterwipe_cb, nullptr);
297-
static_cast<lcb::Server *>(ppold[ii])->purge(LCB_ERR_MAP_CHANGED);
298-
static_cast<lcb::Server *>(ppold[ii])->close();
337+
for (unsigned ii = 0; ii < nnew; ii++) {
338+
ppnew[ii]->parent = cq;
339+
ppnew[ii]->index = ii;
340+
}
341+
342+
char *new_scheds = reinterpret_cast<char *>(calloc(nnew + 1, sizeof(char)));
343+
char *old_scheds = cq->scheds;
344+
345+
/* Note: we do NOT free cq->pipelines here. old_pipelines_buf
346+
* aliases that buffer and the drain loop below still walks it
347+
* via old_pipelines_buf[ii]. The free is deferred to the bottom
348+
* of this function. */
349+
cq->pipelines = new_queue_pipelines;
350+
cq->npipelines = nnew;
351+
cq->_npipelines_ex = new_ex;
352+
cq->scheds = new_scheds;
353+
cq->config = newconfig;
354+
355+
free(old_scheds);
356+
}
357+
358+
/* Drain old pipelines that were not carried over: relocate their
359+
* pending packets onto the new pipelines (mcreq_iterwipe ->
360+
* iterwipe_cb), purge any that cannot be relocated, then close the
361+
* pipeline. */
362+
for (unsigned ii = 0; ii < nold; ii++) {
363+
if (moved[ii]) {
364+
continue;
365+
}
366+
mcreq_iterwipe(cq, old_pipelines_buf[ii], iterwipe_cb, nullptr);
367+
static_cast<lcb::Server *>(old_pipelines_buf[ii])->purge(LCB_ERR_MAP_CHANGED);
368+
static_cast<lcb::Server *>(old_pipelines_buf[ii])->close();
299369
}
300370

301-
for (ii = 0; ii < nnew; ii++) {
371+
for (unsigned ii = 0; ii < nnew; ii++) {
302372
if (static_cast<lcb::Server *>(ppnew[ii])->has_pending()) {
303373
ppnew[ii]->flush_start(ppnew[ii]);
304374
}
305375
}
306376

377+
free(moved);
307378
free(ppnew);
308-
free(ppold);
379+
free(old_pipelines_buf);
309380
}
310381

311382
void lcb_update_vbconfig(lcb_INSTANCE *instance, lcb_pCONFIGINFO config)

src/settings.cc

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,7 +46,7 @@ void lcb_default_settings(lcb_settings *settings)
4646
settings->retry[LCB_RETRY_ON_SOCKERR] = LCB_DEFAULT_NETRETRY;
4747
settings->retry[LCB_RETRY_ON_TOPOCHANGE] = LCB_DEFAULT_TOPORETRY;
4848
settings->retry[LCB_RETRY_ON_VBMAPERR] = LCB_DEFAULT_NMVRETRY;
49-
settings->retry[LCB_RETRY_ON_MISSINGNODE] = 0;
49+
settings->retry[LCB_RETRY_ON_MISSINGNODE] = LCB_DEFAULT_MISSINGNODERETRY;
5050
settings->bc_http_urltype = LCB_DEFAULT_HTCONFIG_URLTYPE;
5151
settings->compressopts = LCB_DEFAULT_COMPRESSOPTS;
5252
settings->compress_min_size = LCB_DEFAULT_COMPRESS_MIN_SIZE;

src/settings.h

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,15 @@
8080
#define LCB_DEFAULT_TOPORETRY LCB_RETRY_CMDS_ALL
8181
#define LCB_DEFAULT_NETRETRY LCB_RETRY_CMDS_ALL
8282
#define LCB_DEFAULT_NMVRETRY LCB_RETRY_CMDS_ALL
83+
/* Retry an op against the retry queue when the vbucket map briefly has no
84+
* master mapped (srvix < 0 || srvix >= cq->npipelines). The default flipped
85+
* from 0 to 1 in CCBC-1702 because, during replace_config(), the cmdq
86+
* transiently has cq->npipelines == 0 between take_pipelines() and
87+
* add_pipelines(); any retryq tick inside that window would otherwise fail
88+
* the op with LCB_ERR_NO_MATCHING_SERVER even though a healthy map is
89+
* about to be installed. The op deadline still bounds how long we keep
90+
* retrying. */
91+
#define LCB_DEFAULT_MISSINGNODERETRY LCB_RETRY_CMDS_ALL
8392
#define LCB_DEFAULT_HTCONFIG_URLTYPE LCB_HTCONFIG_URLTYPE_TRYALL
8493
#define LCB_DEFAULT_COMPRESSOPTS LCB_COMPRESS_INOUT
8594

src/vbucket/vbucket.c

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -900,6 +900,23 @@ void lcbvb_destroy(lcbvb_CONFIG *conf)
900900
free(conf->vbuckets);
901901
free(conf->ffvbuckets);
902902
free(conf->randbuf);
903+
/* CCBC-1702: poison freed pointers so that any latent UAF on this
904+
* struct (e.g. through cmdq.config or a captured lcbvb_CONFIG*) faults
905+
* deterministically at the offending field rather than reading garbage
906+
* from a recycled allocation. The struct itself is freed below; if
907+
* anyone deref'd the struct after this returns, AddressSanitizer would
908+
* have caught it -- but in production builds, NULL-deref is a far more
909+
* actionable signal than reading whatever the next allocator hands out. */
910+
conf->servers = NULL;
911+
conf->continuum = NULL;
912+
conf->buuid = NULL;
913+
conf->bname = NULL;
914+
conf->vbuckets = NULL;
915+
conf->ffvbuckets = NULL;
916+
conf->randbuf = NULL;
917+
conf->nsrv = 0;
918+
conf->ndatasrv = 0;
919+
conf->nvb = 0;
903920
free(conf);
904921
}
905922

0 commit comments

Comments
 (0)