You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
CCBC-1702: requeue ops when vbmap briefly has no master
The historical default LCB_RETRY_ON_MISSINGNODE = 0 made retryq fail
ops with LCB_ERR_NO_MATCHING_SERVER whenever lcbvb_vbmaster() returned
-1 or returned an index >= cq->npipelines for the requested vbucket.
That happens whenever the cluster's view of the vbmap is in transition
-- e.g., immediately after a failover, before a replica is promoted,
or during the brief window when the cmdq is in the middle of swapping
in a new pipeline array via replace_config().
This is not a CBS 6.0-specific bug. The race exists on every server
version. CBS 6.0 just exposes it more reliably because no
PROTOCOL_BINARY_FEATURE_DUPLEX means no server-driven config push, no
idle inbound read watcher, and config arrives later via bgpoll
(default 2.5 s) -- giving retryq more chances to fire inside the
replace window. On newer servers the duplex push path drives a single
atomic config replace from the read callback, hitting the same race
far less often.
Empirical confirmation from SDKD situational tests vs CBS 6.0.5 + TLS:
zero LCB_ERR_TIMEOUT errors observed across the baseline (build-1515)
or the CCBC-1701 keepalive build (build-1522), confirming the
timeout-detection hypothesis was off-target. Errors are dominated by
LCB_ERR_NO_MATCHING_SERVER from this code path and downstream
LCB_ERR_DOCUMENT_NOT_FOUND from retried ops landing on a
partially-converged map. With the fixes here, build-1523 reports
NO_MATCHING_SERVER count 363 -> 0 across the entire suite.
Fixes applied
-------------
1) Flip LCB_RETRY_ON_MISSINGNODE default from 0 to LCB_RETRY_CMDS_ALL.
Packets whose vbucket has no mapped master are requeued back into
the retry queue rather than failed immediately; the op deadline
(default 2.5 s) still bounds how long we keep retrying. If the map
never recovers within the deadline, the op fails with the original
error (preserved as origerr) -- same user-visible code, just not
from a scheduling coincidence. Opt-out is unchanged via cntl /
connection-string "retry=missingnode=0".
2) Hold a ConfigInfo ref across Server::handle_nmv. The pre-fix code
read instance->cmdq.config (a raw lcbvb_CONFIG*) without a
refcount, so cccp_update() inside the handler -- or any other
config-replace path that fires in a nested event-loop frame --
could decref the old ConfigInfo and free its lcbvb_CONFIG while
lcbvb_nmv_remap_ex() was mid-deref. Build-1523 hit this as a
SIGSEGV in lcbvb_nmv_remap_ex during FoRecoverDelta scenarios when
the (1) requeue change kept ops in flight long enough to reach
handle_nmv across a config replacement. The latent UAF was
reachable on master too, just rarer; the requeue fix surfaced it.
Hold the ref at function entry, decref on every return path.
3) Defensive zeroing in lcbvb_destroy(). After freeing the contained
pointers, NULL them out before freeing the struct. Any remaining
latent UAF on lcbvb_CONFIG (through cmdq.config, a captured
lcbvb_CONFIG*, or a stale Server) now NULL-derefs deterministically
at the offending field instead of reading whatever the next
allocator hands out. Cheap insurance.
4) Atomic swap in replace_config(). Replaced the
take_pipelines/build/add_pipelines pattern -- which left the cmdq
with pipelines=NULL and npipelines=0 across an arbitrary number of
synchronous Server constructor allocs -- with a single tight swap
block at the end of replace_config(). cq->pipelines, cq->npipelines,
cq->_npipelines_ex, cq->scheds, and cq->config are now updated as
one sequence; the cmdq is never observably in a half-installed
state. This is structural defense-in-depth.
5) Bounds guard in lcb_vbguess_remap(). After PS2 still reproduced
exactly one SIGSEGV in FoRecoverDelta-SUBDOC at the first deref of
cfg->vbuckets[vbid] (build-1524), with cfg = LCBT_VBCONFIG(instance)
pointing at a freed-and-poisoned lcbvb_CONFIG (vbuckets=NULL via
(3)). The handle_nmv ref guard in (2) pins cur_configinfo but the
deref is on cmdq.config; there is at least one SUBDOC code path
where the two diverge or where cmdq.config briefly points at a
stale vbc. Until that path is fully traced, gate the deref:
reject NULL cfg, NULL cfg->vbuckets, or vbid out of [0, cfg->nvb).
On reject, the op falls through lcb_kv_should_retry ->
mcreq_renew_packet -> retryq->nmvadd, which is the same path as a
legitimate "no remap currently available" and waits for the next
config to arrive. Logged at WARN so the rejection is visible in
production logs.
Tests
-----
tests/iotests/t_netfail.cc adds two cases. Both reproduce the failure
mode without iptables.
testRetryOnMissingNodeAfterMapRepair: stops the config monitor, sets
vbuckets[vb].servers[0] = -1 to force lcbvb_vbmaster() to return -1
(the precise condition retryq.cc:264 trips on), schedules a 200 ms
timer that restores the master in place, and issues an lcb_get with
a 2 s deadline. On gerrit/master (MISSINGNODE = 0) retryq fails the
op at the first 10 ms tick with LCB_ERR_NO_MATCHING_SERVER. With (1)
applied, the op stays in retryq across ticks; once the timer
restores the map, the next tick dispatches it on the correct
pipeline and the GET succeeds.
testConfigReplaceMidRetry: same setup, but instead of restoring the
live vbc in place, snapshots the vbc to JSON via lcbvb_save_json(),
loads it into a brand-new lcbvb_CONFIG, wraps it in a ConfigInfo, and
calls lcb_update_vbconfig() from the timer. This exercises the full
replace_config() path: cur_configinfo is swapped, replace_config()
moves pipelines, the old ConfigInfo is decref'd, and the old
lcbvb_CONFIG is freed via lcbvb_destroy(). Without (4) the cmdq
would have been visibly inconsistent during the swap; without (2)
and (3) any captured raw lcbvb_CONFIG* across the call would UAF on
the freed memory. With all five fixes, the post-swap retryq tick
dispatches on the new pipelines and the GET succeeds.
The pre-existing testNegativeIndex still passes; the op now waits
the full 500 ms op_timeout before failing with NO_MATCHING_SERVER
(preserved as origerr) instead of failing fast.
Verified locally on libev/libevent IO plugins: full unit-tests suite
(194/194, modulo 1 environment-only Behavior.PluginDefaults skip)
green.
Files touched: src/settings.{cc,h}, src/mcserver/mcserver.cc,
src/newconfig.cc, src/vbucket/vbucket.c, tests/iotests/t_netfail.cc.
Change-Id: I83c8c95a1b279082cd750d8363664f14a78aa0c5
Reviewed-on: https://review.couchbase.org/c/libcouchbase/+/244745
Tested-by: Build Bot <build@couchbase.com>
Reviewed-by: Sergey Avseyev <sergey.avseyev@gmail.com>
0 commit comments