Slow watchers impact PUT latency

### Bug report criteria

- [x] This bug report is not security related, security issues should be disclosed privately via [etcd maintainers](mailto:etcd-maintainers@googlegroups.com).
- [x] This is not a support request or question, support requests or questions should be raised in the etcd [discussion forums](https://github.com/etcd-io/etcd/discussions).
- [x] You have read the etcd [bug reporting guidelines](https://github.com/etcd-io/etcd/blob/main/Documentation/contributor-guide/reporting_bugs.md).
- [x] Existing open issues along with etcd [frequently asked questions](https://etcd.io/docs/latest/faq) have been checked and this is not a duplicate.

### What happened?

Debugging a k8s scale test failures and found that applying mutation requests (write transaction) could be delayed up to 1 - 2 seconds, which breaches the [upstream SLO](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md) in the clusterloader2 test SLO measurement. 

cc @hakuna-matatah

> Latency of processing mutating API calls for single objects for every (resource, verb) pair, measured as 99th percentile over last 5 minutes

> In default Kubernetes installation, for every (resource, verb) pair, excluding virtual and aggregated resources and Custom Resource Definitions, 99th percentile per cluster-day[1](https://github.com/kubernetes/community/blob/master/sig-scalability/slos/slos.md#footnote1) <= 1s

```
{
    "level": "info",
    "ts": "2024-05-30T07:25:09.038516Z",
    "caller": "traceutil/trace.go:171",
    "msg": "trace[734230629] transaction",
    "detail": "{req_content:compare:<target:MOD key:\"/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal\" mod_revision:66816841 > success:<request_put:<key:\"/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal\" value_size:577 >> failure:<request_range:<key:\"/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal\" > >; read_only:false; response_revision:66831091; number_of_response:1; applied-index:66997644; }",
    "duration": "1.146904799s",
    "start": "2024-05-30T07:25:07.891574Z",
    "end": "2024-05-30T07:25:09.038479Z",
    "steps": [
        "trace[734230629] 'register request to wait'  (duration: 14.096µs)",
        "trace[734230629] 'leader propose request'  (duration: 4.609235ms)",
        "trace[734230629] 'follower receive request'  (duration: 3.222613ms)",
        "trace[734230629] 'apply start'  (duration: 8.69631ms)",
        "trace[734230629] 'compare'  (duration: 22.423µs)",
        "trace[734230629] 'check requests'  (duration: 2.474µs)",
        "trace[734230629] 'get key's previous created_revision and leaseID' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 12.374µs)",
        "trace[734230629] 'marshal mvccpb.KeyValue' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 1.589µs)",
        "trace[734230629] 'store kv pair into bolt db' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 8.567µs)",
        "trace[734230629] 'attach lease to kv pair' {req_type:put; key:/registry/leases/kube-node-lease/ip-10-11-251-219.us-west-2.compute.internal; req_size:658; } (duration: 243ns)",
        "trace[734230629] 'applied transaction'  (duration: 1.466µs)",
        "trace[734230629] 'end transaction'  (duration: 1.129944403s)"
    ],
    "step_count": 12
}
```

The slow `end transaction` step was caused by the `watchableStore.mutex` lock was accquired by the `syncWatchers` process. 

```
{
    "level": "info",
    "ts": "2024-05-31T08:06:25.099259Z",
    "caller": "traceutil/trace.go:171",
    "msg": "trace[1699572871] transaction",
    "detail": "{req_content:compare:<target:MOD key:\"/registry/leases/kube-node-lease/ip-10-8-7-80.us-west-2.compute.internal\" mod_revision:117957991 > success:<request_put:<key:\"/registry/leases/kube-node-lease/ip-10-8-7-80.us-west-2.compute.internal\" value_size:565 >> failure:<request_range:<key:\"/registry/leases/kube-node-lease/ip-10-8-7-80.us-west-2.compute.internal\" > >; read_only:false; response_revision:117967637; number_of_response:1; applied-index:118404076; }",
    "duration": "1.291462327s",
    "start": "2024-05-31T08:06:23.807769Z",
    "end": "2024-05-31T08:06:25.099231Z",
    "steps": [
        "trace[1699572871] 'apply start'  (duration: 389.561483ms)",
        "trace[1699572871] 'after lock watchableStore mutex'  (duration: 897.589474ms)"
    ],
    "step_count": 2
}
```

```
{
    "level": "warn",
    "ts": "2024-05-31T08:06:23.086653Z",
    "caller": "mvcc/watchable_store.go:230",
    "msg": "slow sync watchers process",
    "took": "905.879353ms",
    "expected-duration-threshold": "100ms"
}
```



### What did you expect to happen?

I would like the `syncWatcher` to be completed within `100ms` and not holding the lock too long. 

### How can we reproduce it (as minimally and precisely as possible)?

I can work on a new benchmark cmd to simulate. As long as put enough writes (qps and throughput) to etcd and have a watch established in this key prefix, the reproduce could be archived. 

### Anything else we need to know?

Exploring options
1. Increase the hardcoded [chanBufLen = 128](https://github.com/etcd-io/etcd/blob/release-3.5/server/mvcc/watchable_store.go#L36) which would put events in the sending buffer, so the watchers won't land into as unsyncd watchers in the first place.
2. Optimize [syncWatcher](https://github.com/etcd-io/etcd/blob/release-3.5/server/mvcc/watchable_store.go#L333) latency for each run, for example, limit the number of events to read from backend.
3. Lower the frequency of running [syncWatcher](https://github.com/etcd-io/etcd/blob/release-3.5/server/mvcc/watchable_store.go#L333) so the e2e latency would have less jitters and p99 value would be better.
4. Optimize the throughput of [func (sws *serverWatchStream) sendLoop()](https://github.com/etcd-io/etcd/blob/release-3.5/server/etcdserver/api/v3rpc/watch.go#L375) strategically. 

Option 1 and 2 is helpful to cut down the mutating requests latency to 0.2s with the same k8s scale test repeated runs. Option 3 is not. 

### Etcd version (please run commands below)

All supported etcd versions


### Etcd configuration (command line flags or environment variables)

<details>

# paste your configuration here

</details>


### Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

<details>

```console
$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here
```

</details>


### Relevant log output

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Slow watchers impact PUT latency #18109

Bug report criteria

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Slow watchers impact PUT latency #18109

Description

Bug report criteria

What happened?

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

paste your configuration here

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions