Skip to content

Slow watchers impact PUT latency #18109

Open
@chaochn47

Description

@chaochn47

Bug report criteria

What happened?

Debugging a k8s scale test failures and found that applying mutation requests (write transaction) could be delayed up to 1 - 2 seconds, which breaches the upstream SLO in the clusterloader2 test SLO measurement.

cc @hakuna-matatah

Latency of processing mutating API calls for single objects for every (resource, verb) pair, measured as 99th percentile over last 5 minutes

In default Kubernetes installation, for every (resource, verb) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99th percentile per cluster-day1 <= 1s

{
    "level": "info",
    "ts": "2024-05-30T07:25:09.038516Z",
    "caller": "traceutil/trace.go:171",
    "msg": "trace[734230629] transaction",
    "detail": "{req_content:compare:<target:MOD key:\"/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal\" mod_revision:66816841 > success:<request_put:<key:\"/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal\" value_size:577 >> failure:<request_range:<key:\"/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal\" > >; read_only:false; response_revision:66831091; number_of_response:1; applied-index:66997644; }",
    "duration": "1.146904799s",
    "start": "2024-05-30T07:25:07.891574Z",
    "end": "2024-05-30T07:25:09.038479Z",
    "steps": [
        "trace[734230629] 'register request to wait'  (duration: 14.096µs)",
        "trace[734230629] 'leader propose request'  (duration: 4.609235ms)",
        "trace[734230629] 'follower receive request'  (duration: 3.222613ms)",
        "trace[734230629] 'apply start'  (duration: 8.69631ms)",
        "trace[734230629] 'compare'  (duration: 22.423µs)",
        "trace[734230629] 'check requests'  (duration: 2.474µs)",
        "trace[734230629] 'get key's previous created_revision and leaseID' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 12.374µs)",
        "trace[734230629] 'marshal mvccpb.KeyValue' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 1.589µs)",
        "trace[734230629] 'store kv pair into bolt db' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 8.567µs)",
        "trace[734230629] 'attach lease to kv pair' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 243ns)",
        "trace[734230629] 'applied transaction'  (duration: 1.466µs)",
        "trace[734230629] 'end transaction'  (duration: 1.129944403s)"
    ],
    "step_count": 12
}

The slow end transaction step was caused by the watchableStore.mutex lock was accquired by the syncWatchers process.

{
    "level": "info",
    "ts": "2024-05-31T08:06:25.099259Z",
    "caller": "traceutil/trace.go:171",
    "msg": "trace[1699572871] transaction",
    "detail": "{req_content:compare:<target:MOD key:\"/registry/leases/kube-node-lease/ip-10-8-7-80.us-west-2.compute.internal\" mod_revision:117957991 > success:<request_put:<key:\"/registry/leases/kube-node-lease/ip-10-8-7-80.us-west-2.compute.internal\" value_size:565 >> failure:<request_range:<key:\"/registry/leases/kube-node-lease/ip-10-8-7-80.us-west-2.compute.internal\" > >; read_only:false; response_revision:117967637; number_of_response:1; applied-index:118404076; }",
    "duration": "1.291462327s",
    "start": "2024-05-31T08:06:23.807769Z",
    "end": "2024-05-31T08:06:25.099231Z",
    "steps": [
        "trace[1699572871] 'apply start'  (duration: 389.561483ms)",
        "trace[1699572871] 'after lock watchableStore mutex'  (duration: 897.589474ms)"
    ],
    "step_count": 2
}
{
    "level": "warn",
    "ts": "2024-05-31T08:06:23.086653Z",
    "caller": "mvcc/watchable_store.go:230",
    "msg": "slow sync watchers process",
    "took": "905.879353ms",
    "expected-duration-threshold": "100ms"
}

What did you expect to happen?

I would like the syncWatcher to be completed within 100ms and not holding the lock too long.

How can we reproduce it (as minimally and precisely as possible)?

I can work on a new benchmark cmd to simulate. As long as put enough writes (qps and throughput) to etcd and have a watch established in this key prefix, the reproduce could be archived.

Anything else we need to know?

Exploring options

  1. Increase the hardcoded chanBufLen = 128 which would put events in the sending buffer, so the watchers won't land into as unsyncd watchers in the first place.
  2. Optimize syncWatcher latency for each run, for example, limit the number of events to read from backend.
  3. Lower the frequency of running syncWatcher so the e2e latency would have less jitters and p99 value would be better.
  4. Optimize the throughput of func (sws *serverWatchStream) sendLoop() strategically.

Option 1 and 2 is helpful to cut down the mutating requests latency to 0.2s with the same k8s scale test repeated runs. Option 3 is not.

Etcd version (please run commands below)

All supported etcd versions

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions