Skip to content

feat: plumb EventRecorder and emit failure events#221

Open
fmuyassarov wants to merge 1 commit into
kubernetes-sigs:mainfrom
Nordix:feat/events
Open

feat: plumb EventRecorder and emit failure events#221
fmuyassarov wants to merge 1 commit into
kubernetes-sigs:mainfrom
Nordix:feat/events

Conversation

@fmuyassarov
Copy link
Copy Markdown
Member

What type of PR is this?

/kind feature

What this PR does / why we need it:

Add eventRecorder to the driver and as a starting point we record and publish only the following failure events: ClaimPrepareFailed, NetworkDeviceAttachFailed, RDMADeviceAttachFailed, NetworkDeviceDetachFailed and NetworkDeviceDetachFailed.

One thing I wasn't fully convinced about was the benefit of propagating the event into the driver logs via ...

eventBroadcaster.StartLogging(klog.Infof)

This basically replicates the event into the kubectl logs of the driver. In a way I don't see an issue propagating it to the logs and it may help to discovery the failures a bit faster but at the same time I'm thinking that publishing the failure in the events is enough. I don't have a strong opinion on this and would be happy to hear you think on this.
in practice in looks like this

I0605 09:29:20.122080       1 db.go:592] Device "dummy0" not found in local store, rescanning.
I0605 09:29:20.271155       1 dra_hooks.go:432] claim c8810e22-6955-4690-911b-3251b027bcd1 contain errors: failed to get network interface name for device dummy0: device dummy0 not found in store
I0605 09:29:20.271283       1 nonblockinggrpcserver.go:161] "handling request succeeded" logger="dra" requestID=4 method="/k8s.io.kubelet.pkg.apis.dra.v1.DRAPlugin/NodePrepareResources" response="claims:{key:\"c8810e22-6955-4690-911b-3251b027bcd1\" value:{error:\"claim c8810e22-6955-4690-911b-3251b027bcd1 contain errors: failed to get network interface name for device dummy0: device dummy0 not found in store\"}}"
I0605 09:29:20.271407       1 event.go:377] Event(v1.ObjectReference{Kind:"ResourceClaim", Namespace:"default", Name:"dummy-interface-static-ip", UID:"c8810e22-6955-4690-911b-3251b027bcd1", APIVersion:"resource.k8s.io/v1", ResourceVersion:"897", FieldPath:""}): type: 'Warning' reason: 'ClaimPrepareFailed' failed to get network interface name for device dummy0: device dummy0 not found in store
I0605 09:30:38.120584       1 db.go:592] Device "dummy0" not found in local store, rescanning.
I0605 09:30:38.271555       1 dra_hooks.go:432] claim c8810e22-6955-4690-911b-3251b027bcd1 contain errors: failed to get network interface name for device dummy0: device dummy0 not found in store
I0605 09:30:38.271710       1 nonblockinggrpcserver.go:161] "handling request succeeded" logger="dra" requestID=5 method="/k8s.io.kubelet.pkg.apis.dra.v1.DRAPlugin/NodePrepareResources" response="claims:{key:\"c8810e22-6955-4690-911b-3251b027bcd1\" value:{error:\"claim c8810e22-6955-4690-911b-3251b027bcd1 contain errors: failed to get network interface name for device dummy0: device dummy0 not found in store\"}}"
I0605 09:30:38.271879       1 event.go:377] Event(v1.ObjectReference{Kind:"ResourceClaim", Namespace:"default", Name:"dummy-interface-static-ip", UID:"c8810e22-6955-4690-911b-3251b027bcd1", APIVersion:"resource.k8s.io/v1", ResourceVersion:"897", FieldPath:""}): type: 'Warning' reason: 'ClaimPrepareFailed' failed to get network interface name for device dummy0: device dummy0 not found in store
I0605 09:31:40.121683       1 db.go:592] Device "dummy0" not found in local store, rescanning.
I0605 09:31:40.268307       1 dra_hooks.go:432] claim c8810e22-6955-4690-911b-3251b027bcd1 contain errors: failed to get network interface name for device dummy0: device dummy0 not found in store
I0605 09:31:40.268448       1 nonblockinggrpcserver.go:161] "handling request succeeded" logger="dra" requestID=6 method="/k8s.io.kubelet.pkg.apis.dra.v1.DRAPlugin/NodePrepareResources" response="claims:{key:\"c8810e22-6955-4690-911b-3251b027bcd1\" value:{error:\"claim c8810e22-6955-4690-911b-3251b027bcd1 contain errors: failed to get network interface name for device dummy0: device dummy0 not found in store\"}}"
I0605 09:31:40.268559       1 event.go:377] Event(v1.ObjectReference{Kind:"ResourceClaim", Namespace:"default", Name:"dummy-interface-static-ip", UID:"c8810e22-6955-4690-911b-3251b027bcd1", APIVersion:"resource.k8s.io/v1", ResourceVersion:"897", FieldPath:""}): type: 'Warning' reason: 'ClaimPrepareFailed' failed to get network interface name for device dummy0: device dummy0 not found in store

Which issue(s) this PR is related to:

Fixes: #194

Special notes for your reviewer:

I tried to do bare minimum end to end testing and tried to mimic the failure to test at least ClaimPrepareFailed event on a resourceClaim when it can not prepare a device.

The idea was that I create a dummy interface and it is published in the resourceSlice as available device. Then I install the deviceClass and send SIGSTOP signal to the driver pod so that it can't process requests, this is basically to kind of break/pause the pod. Then I deleted the dummy interface while the driver can't process any changes. Finally, I created a sort of workload pod with resourceClaim. So at this point schedular tries to allocate dummy interface since it is available on the resourceSlice (driver isn't aware or didn't have a chance to rescan the devices since we broke it intentionally). Now kubelet makes nodePrepareResources call and then I send SIGCONT signal to the driver Pod so that it starts handling the request but it fails to find the device now since we deleted the interface and returns an error from which we create an event. Sorry if this was long but I could not come up with better idea of testing.

Below are some snippets from my test if that helps to review:

kubectl get pods
NAME   READY   STATUS              RESTARTS   AGE
pod1   0/1     ContainerCreating   0          111s
kubectl get events -w

8m17s       Normal    NodeAllocatableEnforced         node/dra-worker2                          Updated Node Allocatable limit across pods
8m15s       Normal    Starting                        node/dra-worker2
8m15s       Normal    RegisteredNode                  node/dra-worker2                          Node dra-worker2 event: Registered Node dra-worker2 in Controller
8m2s        Normal    NodeReady                       node/dra-worker2                          Node dra-worker2 status is now: NodeReady
5m19s       Warning   ClaimPrepareFailed              resourceclaim/dummy-interface-static-ip   failed to get netlink to interface dummy0: Link not found
17s         Warning   ClaimPrepareFailed              resourceclaim/dummy-interface-static-ip   failed to get network interface name for device dummy0: device dummy0 not found in store
5m34s       Normal    Scheduled                       pod/pod1                                  Successfully assigned default/pod1 to dra-worker2
5m19s       Warning   FailedPrepareDynamicResources   pod/pod1                                  Failed to prepare dynamic resources: prepare dynamic resources: NodePrepareResources failed for ResourceClaim dummy-interface-static-ip: claim c8810e22-6955-4690-911b-3251b027bcd1 contain errors: failed to get netlink to interface dummy0: Link not found
17s         Warning   FailedPrepareDynamicResources   pod/pod1                                  Failed to prepare dynamic resources: prepare dynamic resources: NodePrepareResources failed for ResourceClaim dummy-interface-static-ip: claim c8810e22-6955-4690-911b-3251b027bcd1 contain errors: failed to get network interface name for device dummy0: device dummy0 not found in store
kubectl describe  resourceClaim dummy-interface-static-ip
...
Events:
  Type     Reason              Age   From     Message
  ----     ------              ----  ----     -------
  Warning  ClaimPrepareFailed  43s   dra.net  failed to get netlink to interface dummy0: Link not found
kubectl describe pod pod1

Events:
  Type     Reason                         Age    From               Message
  ----     ------                         ----   ----               -------
  Normal   Scheduled                      2m22s  default-scheduler  Successfully assigned default/pod1 to dra-worker2
  Warning  FailedPrepareDynamicResources  2m7s   kubelet            Failed to prepare dynamic resources: prepare dynamic resources: NodePrepareResources failed for ResourceClaim dummy-interface-static-ip: claim c8810e22-6955-4690-911b-3251b027bcd1 contain errors: failed to get netlink to interface dummy0: Link not found
  Warning  FailedPrepareDynamicResources  43s    kubelet            Failed to prepare dynamic resources: prepare dynamic resources: NodePrepareResources failed for ResourceClaim dummy-interface-static-ip: claim c8810e22-6955-4690-911b-3251b027bcd1 contain errors: failed to get network interface name for device dummy0: device dummy0 not found in store

Does this PR introduce a user-facing change?

NONE

This commit adds eventRecorder to the driver and as a starting
point we record and publish only the following failure events:
ClaimPrepareFailed, NetworkDeviceAttachFailed, RDMADeviceAttachFailed,
NetworkDeviceDetachFailed and NetworkDeviceDetachFailed.

Signed-off-by: Feruzjon Muyassarov <feruzjon.muyassarov@est.tech>
@k8s-ci-robot k8s-ci-robot added release-note-none Denotes a PR that doesn't merit a release note. kind/feature Categorizes issue or PR as related to a new feature. labels Jun 5, 2026
@netlify
Copy link
Copy Markdown

netlify Bot commented Jun 5, 2026

Deploy Preview for dranet canceled.

Name Link
🔨 Latest commit 3aa6a69
🔍 Latest deploy log https://app.netlify.com/projects/dranet/deploys/6a229862c6e3190008d677ff

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: fmuyassarov
Once this PR has been reviewed and has the lgtm label, please assign aojea for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 5, 2026
@fmuyassarov
Copy link
Copy Markdown
Member Author

/cc @aojea @gauravkghildiyal

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dranet has no EventRecorder plumbed today

2 participants