Description
Bug report criteria
- This bug report is not security related, security issues should be disclosed privately via etcd maintainers.
- This is not a support request or question, support requests or questions should be raised in the etcd discussion forums.
- You have read the etcd bug reporting guidelines.
- Existing open issues along with etcd frequently asked questions have been checked and this is not a duplicate.
What happened?
Debugging a k8s scale test failures and found that applying mutation requests (write transaction) could be delayed up to 1 - 2 seconds, which breaches the upstream SLO in the clusterloader2 test SLO measurement.
Latency of processing mutating API calls for single objects for every (resource, verb) pair, measured as 99th percentile over last 5 minutes
In default Kubernetes installation, for every (resource, verb) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99th percentile per cluster-day1 <= 1s
{
"level": "info",
"ts": "2024-05-30T07:25:09.038516Z",
"caller": "traceutil/trace.go:171",
"msg": "trace[734230629] transaction",
"detail": "{req_content:compare:<target:MOD key:\"/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal\" mod_revision:66816841 > success:<request_put:<key:\"/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal\" value_size:577 >> failure:<request_range:<key:\"/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal\" > >; read_only:false; response_revision:66831091; number_of_response:1; applied-index:66997644; }",
"duration": "1.146904799s",
"start": "2024-05-30T07:25:07.891574Z",
"end": "2024-05-30T07:25:09.038479Z",
"steps": [
"trace[734230629] 'register request to wait' (duration: 14.096µs)",
"trace[734230629] 'leader propose request' (duration: 4.609235ms)",
"trace[734230629] 'follower receive request' (duration: 3.222613ms)",
"trace[734230629] 'apply start' (duration: 8.69631ms)",
"trace[734230629] 'compare' (duration: 22.423µs)",
"trace[734230629] 'check requests' (duration: 2.474µs)",
"trace[734230629] 'get key's previous created_revision and leaseID' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 12.374µs)",
"trace[734230629] 'marshal mvccpb.KeyValue' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 1.589µs)",
"trace[734230629] 'store kv pair into bolt db' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 8.567µs)",
"trace[734230629] 'attach lease to kv pair' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 243ns)",
"trace[734230629] 'applied transaction' (duration: 1.466µs)",
"trace[734230629] 'end transaction' (duration: 1.129944403s)"
],
"step_count": 12
}
The slow end transaction
step was caused by the watchableStore.mutex
lock was accquired by the syncWatchers
process.
{
"level": "info",
"ts": "2024-05-31T08:06:25.099259Z",
"caller": "traceutil/trace.go:171",
"msg": "trace[1699572871] transaction",
"detail": "{req_content:compare:<target:MOD key:\"/registry/leases/kube-node-lease/ip-10-8-7-80.us-west-2.compute.internal\" mod_revision:117957991 > success:<request_put:<key:\"/registry/leases/kube-node-lease/ip-10-8-7-80.us-west-2.compute.internal\" value_size:565 >> failure:<request_range:<key:\"/registry/leases/kube-node-lease/ip-10-8-7-80.us-west-2.compute.internal\" > >; read_only:false; response_revision:117967637; number_of_response:1; applied-index:118404076; }",
"duration": "1.291462327s",
"start": "2024-05-31T08:06:23.807769Z",
"end": "2024-05-31T08:06:25.099231Z",
"steps": [
"trace[1699572871] 'apply start' (duration: 389.561483ms)",
"trace[1699572871] 'after lock watchableStore mutex' (duration: 897.589474ms)"
],
"step_count": 2
}
{
"level": "warn",
"ts": "2024-05-31T08:06:23.086653Z",
"caller": "mvcc/watchable_store.go:230",
"msg": "slow sync watchers process",
"took": "905.879353ms",
"expected-duration-threshold": "100ms"
}
What did you expect to happen?
I would like the syncWatcher
to be completed within 100ms
and not holding the lock too long.
How can we reproduce it (as minimally and precisely as possible)?
I can work on a new benchmark cmd to simulate. As long as put enough writes (qps and throughput) to etcd and have a watch established in this key prefix, the reproduce could be archived.
Anything else we need to know?
Exploring options
- Increase the hardcoded chanBufLen = 128 which would put events in the sending buffer, so the watchers won't land into as unsyncd watchers in the first place.
- Optimize syncWatcher latency for each run, for example, limit the number of events to read from backend.
- Lower the frequency of running syncWatcher so the e2e latency would have less jitters and p99 value would be better.
- Optimize the throughput of func (sws *serverWatchStream) sendLoop() strategically.
Option 1 and 2 is helpful to cut down the mutating requests latency to 0.2s with the same k8s scale test repeated runs. Option 3 is not.
Etcd version (please run commands below)
All supported etcd versions
Etcd configuration (command line flags or environment variables)
paste your configuration here
Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)
$ etcdctl member list -w table
# paste output here
$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here
Relevant log output
No response