-
Notifications
You must be signed in to change notification settings - Fork 105
Description
What steps did you take and what happened:
We have encountered situations where after several weeks of launching pods with LVM volumes, replacing nodes, and generally using the cluster, creation of new volumes slows to a crawl (ie: minutes between creation despite many pods pending and plenty of available space on nodes) and there are hundreds of LVMVolume objects with deletionTimestamps and ownerNodeIDs of non-existent nodes.
Most recently, we found that we had 620 LVMVolume objects, with 340 of them in this state. Deleting these leaked LVMVolume objects speeds up creation of new ones.
We're not 100% certain, but we believe the leak is caused by nodes getting deleted before the LVMVolumes associated with them have been cleaned up.
In our case, we use Karpenter to manage nodes, and it deletes the underlying EC2 instance when the node becomes empty of non-daemonset pods for some period of time.
This may be the same as #347, but there aren't enough details in that one to be sure.
What did you expect to happen:
- LVMVolumes on nodes that no longer exist should be deleted, not blocked by a finalizer that never gets removed. If the node doesn't exist, the volume on that node also doesn't exist.
- Having many LVMVolume objects should not impact creation of new ones.
The output of the following commands will help us better understand what's going on:
(Pasting long output into a GitHub gist or other Pastebin is fine.)
Controller logs:
csi-provisioner I0616 16:21:36.567743 1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-2c65641e-440e-4e31-8192-1ad6acc240c4", UID:"f09e50ff-c09f-4ded-b65a-1e4c15d23319", APIVersion:"v1", ResourceVersion:"883943724", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' rpc error: code = DeadlineExceeded desc = context deadline exceeded
csi-provisioner I0616 16:21:36.569874 1 connection.go:200] GRPC response: {}
csi-provisioner I0616 16:21:36.569915 1 connection.go:201] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
csi-provisioner E0616 16:21:36.569952 1 controller.go:1512] delete "pvc-b2317477-1904-4591-98d6-2c5474448c61": volume deletion failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
csi-provisioner W0616 16:21:36.570012 1 controller.go:989] Retrying syncing volume "pvc-b2317477-1904-4591-98d6-2c5474448c61", failure 1034
csi-provisioner E0616 16:21:36.570030 1 controller.go:1007] error syncing volume "pvc-b2317477-1904-4591-98d6-2c5474448c61": rpc error: code = DeadlineExceeded desc = context deadline exceeded
csi-provisioner I0616 16:21:36.570044 1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolume", Namespace:"", Name:"pvc-b2317477-1904-4591-98d6-2c5474448c61", UID:"c8ed79ca-e881-459c-b378-e030e60a4efa", APIVersion:"v1", ResourceVersion:"885072755", FieldPath:""}): type: 'Warning' reason: 'VolumeFailedDelete' rpc error: code = DeadlineExceeded desc = context deadline exceeded
openebs-lvm-plugin I0616 16:21:36.516580 1 controller.go:392] received request to delete volume "pvc-65c90c1a-9eaa-4b29-8a83-7b2a46e4f055"
openebs-lvm-plugin I0616 16:21:36.518581 1 controller.go:392] received request to delete volume "pvc-13b08b9b-6a47-4efb-9062-668e73f62f42"
openebs-lvm-plugin I0616 16:21:36.518835 1 controller.go:392] received request to delete volume "pvc-617ee615-f29b-46dc-8964-b60b50a9c791"
csi-provisioner I0616 16:21:36.735543 1 request.go:628] Waited for 119.260122ms due to client-side throttling, not priority and fairness, request: PATCH:https://10.200.32.1:443/api/v1/namespaces/default/events/pvc-341e0ac9-cada-44f1-9ea8-76bd71876d75.1848b80983569725
openebs-lvm-plugin E0616 16:21:36.534025 1 grpc.go:79] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
csi-provisioner I0616 16:21:36.935593 1 request.go:628] Waited for 192.326659ms due to client-side throttling, not priority and fairness, request: PATCH:https://10.200.32.1:443/api/v1/namespaces/default/events/pvc-2af87e3b-f5d4-4e66-a0bb-4de2e5d2f5b6.1848b8098318babb
csi-provisioner I0616 16:21:37.135645 1 request.go:628] Waited for 191.196251ms due to client-side throttling, not priority and fairness, request: PATCH:https://10.200.32.1:443/api/v1/namespaces/default/events/pvc-f5cb28db-4b43-40b7-a785-93fa31c008de.1848b80e335b53a8
csi-provisioner I0616 16:21:37.335557 1 request.go:628] Waited for 191.260063ms due to client-side throttling, not priority and fairness, request: PATCH:https://10.200.32.1:443/api/v1/namespaces/default/events/pvc-648bb2ba-c626-41c6-abd4-18696886bdbc.1848b80bd7af66f9
openebs-lvm-plugin E0616 16:21:36.547709 1 grpc.go:79] GRPC error: rpc error: code = Canceled desc = context canceled
openebs-lvm-plugin E0616 16:21:36.547809 1 grpc.go:79] GRPC error: rpc error: code = Canceled desc = context canceled
openebs-lvm-plugin E0616 16:21:36.550346 1 grpc.go:79] GRPC error: rpc error: code = Canceled desc = context canceled
openebs-lvm-plugin E0616 16:21:36.550356 1 grpc.go:79] GRPC error: rpc error: code = Canceled desc = context canceled
openebs-lvm-plugin E0616 16:21:36.550364 1 grpc.go:79] GRPC error: rpc error: code = Canceled desc = context canceled
openebs-lvm-plugin E0616 16:21:36.555528 1 grpc.go:79] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
openebs-lvm-plugin E0616 16:21:36.555549 1 grpc.go:79] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
openebs-lvm-plugin E0616 16:21:36.556030 1 grpc.go:79] GRPC error: rpc error: code = Canceled desc = context canceled
csi-provisioner I0616 16:21:37.535169 1 request.go:628] Waited for 191.323743ms due to client-side throttling, not priority and fairness, request: PATCH:https://10.200.32.1:443/api/v1/namespaces/default/events/pvc-2c65641e-440e-4e31-8192-1ad6acc240c4.1848b809831f00f1
csi-provisioner I0616 16:21:37.735129 1 request.go:628] Waited for 191.660542ms due to client-side throttling, not priority and fairness, request: PATCH:https://10.200.32.1:443/api/v1/namespaces/default/events/pvc-b2317477-1904-4591-98d6-2c5474448c61.1848b80e2cce7cc7
openebs-lvm-plugin E0616 16:21:36.558807 1 grpc.go:79] GRPC error: rpc error: code = Canceled desc = context canceled
openebs-lvm-plugin E0616 16:21:36.558762 1 grpc.go:79] GRPC error: rpc error: code = Canceled desc = context canceled
openebs-lvm-plugin E0616 16:21:36.559058 1 grpc.go:79] GRPC error: rpc error: code = Canceled desc = context canceled
openebs-lvm-plugin E0616 16:21:36.564274 1 grpc.go:79] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
csi-provisioner I0616 16:21:40.258526 1 leaderelection.go:276] successfully renewed lease openebs/local-csi-openebs-io
openebs-lvm-plugin E0616 16:21:36.564412 1 grpc.go:79] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
openebs-lvm-plugin E0616 16:21:36.567553 1 grpc.go:79] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
openebs-lvm-plugin E0616 16:21:36.569798 1 grpc.go:79] GRPC error: rpc error: code = DeadlineExceeded desc = context deadline exceeded
Anything else you would like to add:
We have a script we wrote to clean these leaked volumes up manually.
#!/usr/bin/env python3
import json
import subprocess
import sys
from argparse import ArgumentParser
def kubectl(
context: str,
namespace: str | None,
args: list[str],
):
command = ["kubectl", "--context", context]
if namespace is not None:
command.extend(["--namespace", namespace])
command.extend(args)
return subprocess.run(command, check=True, text=True, capture_output=True)
def parse_args(argv: list[str]):
parser = ArgumentParser()
parser.add_argument("--context", type=str, required=True)
return parser.parse_args(argv)
def main(argv: list[str]):
args = parse_args(argv)
lvmvolumes = json.loads(
kubectl(
context=args.context,
namespace="openebs",
args=["get", "lvmvolumes", "-o", "json"],
).stdout,
)["items"]
deleting_lvmvolumes = [
vol for vol in lvmvolumes if "deletionTimestamp" in vol["metadata"]
]
# Intentionally getting node names after lvmvolumes to avoid race conditions
# where the volume may be created and starting to delete after we query the list of nodes.
# This relies on only cleaning things with deletionTimestamps to be safe.
node_names = [
node["metadata"]["name"]
for node in json.loads(
kubectl(
context=args.context,
namespace=None,
args=["get", "nodes", "-o", "json"],
).stdout,
)["items"]
]
for vol in deleting_lvmvolumes:
if vol["spec"]["ownerNodeID"] not in node_names:
print(f'Deleting lvmvolume: {vol["metadata"]["name"]}')
kubectl(
context=args.context,
namespace="openebs",
args=[
"patch",
"lvmvolumes",
vol["metadata"]["name"],
"-p",
'{"metadata":{"finalizers":null}}',
"--type",
"merge",
],
)
if __name__ == "__main__":
main(sys.argv[1:])Environment:
- LVM Driver version: v1.6.2
- Kubernetes version (use
kubectl version): v1.31.6 - Kubernetes installer & version: EKS
- Cloud provider or hardware configuration: AWS r6gd.8xlarge and r6gd.16xlarge
- OS (e.g. from
/etc/os-release): Bottlerocket 1.39.0-0968c061