Skip to content

Missing delete event on watch opened on same revision as compaction request #19179

Open
@serathius

Description

Bug report criteria

What happened?

Starting from 9 January we started getting failures on presubmit tests.

Presubmit history goes up to December 31, with failures only starting on implying the issue is new.
image

Failues are due to resumable guarantee being broken

 logger.go:146: 2025-01-10T22:33:35.465Z	ERROR	Broke watch guarantee	{"guarantee": "resumable", "client": 4, "request": {"Key":"/registry/pods/","Revision":409,"WithPrefix":true,"WithProgressNotify":true,"WithPrevKV":true}, "got-event": {"Type":"delete-operation","Key":"/registry/pods/default/jCocA","Value":{"Value":"","Hash":0},"Revision":410,"IsCreate":false,"PrevValue":{"Value":{"Value":"143","Hash":0},"ModRevision":146}}, "want-event": {"Type":"delete-operation","Key":"/registry/pods/default/OL767","Value":{"Value":"","Hash":0},"Revision":409,"IsCreate":false}}
    validate.go:48: Failed validating watch history, err: broke Resumable - A broken watch can be resumed by establishing a new watch starting after the last revision received in a watch event before the break, so long as the revision is in the history window

From history visualizations I have seen it follows pattern:

  • Delete operation on rev X
  • Compect on Rev X
  • Etcd crashes on Rev X
  • Watch opened on Rev X

image

What did you expect to happen?

Resumable guarantee should not be broken.

How can we reproduce it (as minimally and precisely as possible)?

Didn't yet managed to reproduce it locally.

Anything else we need to know?

https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-amd64/1877585036438409216
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1877364764741472256
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1877466683589791744
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1877575502907052032
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1877585037260492800
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1877678423757819904
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1877842586459181056
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1878101264374435840
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1878113522366287872
https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/directory/pull-etcd-robustness-arm64/1878196741560340480

Etcd version (please run commands below)

I was not able to reproduce the issue outside of CI, so I haven't confirmed other versions

Etcd configuration (command line flags or environment variables)

N/A

Etcd debug information (please run commands below, feel free to obfuscate the IP address or FQDN in the output)

N/A

Relevant log output

No response

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions