Skip to content

gc interface if it's not exist in podLister#5789

Merged
oilbeater merged 1 commit intomasterfrom
gc-interface
Oct 13, 2025
Merged

gc interface if it's not exist in podLister#5789
oilbeater merged 1 commit intomasterfrom
gc-interface

Conversation

@oilbeater
Copy link
Copy Markdown
Collaborator

Pull Request

What type of this PR

Examples of user facing changes:

  • Bug fixes

Which issue(s) this PR fixes

Fixes #5724

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Oct 13, 2025
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @oilbeater, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug related to orphaned OVS interfaces by overhauling the garbage collection mechanism. Instead of relying solely on OVS internal state, the system now actively cross-references OVS interfaces with the Kubernetes podLister to confirm if a corresponding pod still exists. This change ensures that OVS resources are only cleaned up when their associated Kubernetes pods are truly gone, improving resource management and preventing potential issues with stale OVS entries.

Highlights

  • Refactored OVS Interface Garbage Collection: The pull request replaces the previous ovs.CleanLostInterface function with a new controller-side gcInterfaces method to manage the cleanup of Open vSwitch (OVS) interfaces.
  • Kubernetes PodLister Integration: The new gcInterfaces function now actively leverages the Kubernetes podLister to verify the existence of pods. This ensures that OVS interfaces are only cleaned up if their corresponding Kubernetes Pods are truly gone, preventing premature or incorrect deletions.
  • Granular Interface Cleanup Utilities: New utility functions ovs.ListInterfacePodMap and ovs.CleanInterface have been introduced. ListInterfacePodMap retrieves mappings between OVS interfaces and their associated pods, while CleanInterface provides a targeted way to delete a specific OVS port and its Quality of Service (QoS) configuration.
  • Removed Obsolete Code: The old CleanLostInterface function and its associated test cases have been removed, streamlining the codebase and eliminating redundant logic.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@dosubot dosubot bot added the bug Something isn't working label Oct 13, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a garbage collection mechanism for OVS interfaces that belong to non-existent pods. The implementation is sound, but I've found a couple of critical issues that could lead to panics or compilation errors, along with a medium-severity issue for improving code robustness. My review includes suggestions to fix these issues.

return
}
for iface, pod := range interfacePodMap {
podNamespace, podName := strings.Split(pod, "/")[0], strings.Split(pod, "/")[1]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This line can cause a panic if the pod string from interfacePodMap does not contain a /. This would crash the daemon. It's safer to check the result of strings.Split before accessing elements by index to prevent this.

    		parts := strings.Split(pod, "/")
    		if len(parts) != 2 {
    			klog.Warningf("malformed pod name '%s' for interface %s, skipping", pod, iface)
    			continue
    		}
    		podNamespace, podName := parts[0], parts[1]

}
ifaceName := strings.TrimSpace(parts[0])
var podNamespace, podName string
for externalID := range strings.FieldsSeq(parts[1]) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This loop is incorrect and will cause a compilation error. strings.FieldsSeq is not a standard Go function and appears to be a typo for strings.Fields. Additionally, using range with a single variable on a slice iterates over indices, so externalID would be an int. The loop should be corrected to iterate over the values of the split string.

Suggested change
for externalID := range strings.FieldsSeq(parts[1]) {
for _, externalID := range strings.Fields(parts[1]) {

Comment on lines +259 to 265
if strings.Contains(externalID, "pod_name=") {
podName = strings.TrimPrefix(strings.TrimSpace(externalID), "pod_name=")
}

for _, intf := range interfaceList {
name, errText := strings.Trim(strings.Split(intf, "\n")[0], "\""), strings.Split(intf, "\n")[1]
if strings.Contains(errText, "No such device") {
qosList, err := ovsFind("port", "qos", "name="+name)
if err != nil {
klog.Errorf("failed to find related port %v", err)
return
if strings.Contains(externalID, "pod_namespace=") {
podNamespace = strings.TrimPrefix(strings.TrimSpace(externalID), "pod_namespace=")
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using strings.Contains to check for keys in external_ids is not robust. It can lead to false positives if a key name is a substring of another. It's safer to use strings.HasPrefix to ensure you're matching the exact key. Also, strings.TrimSpace is redundant here since strings.Fields (the corrected function) splits by whitespace and the resulting tokens won't have surrounding spaces.

Suggested change
if strings.Contains(externalID, "pod_name=") {
podName = strings.TrimPrefix(strings.TrimSpace(externalID), "pod_name=")
}
for _, intf := range interfaceList {
name, errText := strings.Trim(strings.Split(intf, "\n")[0], "\""), strings.Split(intf, "\n")[1]
if strings.Contains(errText, "No such device") {
qosList, err := ovsFind("port", "qos", "name="+name)
if err != nil {
klog.Errorf("failed to find related port %v", err)
return
if strings.Contains(externalID, "pod_namespace=") {
podNamespace = strings.TrimPrefix(strings.TrimSpace(externalID), "pod_namespace=")
}
if strings.HasPrefix(externalID, "pod_name=") {
podName = strings.TrimPrefix(externalID, "pod_name=")
}
if strings.HasPrefix(externalID, "pod_namespace=") {
podNamespace = strings.TrimPrefix(externalID, "pod_namespace=")
}

@coveralls
Copy link
Copy Markdown

coveralls commented Oct 13, 2025

Pull Request Test Coverage Report for Build 18459036696

Details

  • 0 of 70 (0.0%) changed or added relevant lines in 2 files are covered.
  • 5 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.02%) to 21.067%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/daemon/controller.go 0 25 0.0%
pkg/ovs/ovs-vsctl.go 0 45 0.0%
Files with Coverage Reduction New Missed Lines %
pkg/ovs/ovn-nb-logical_router_route.go 2 75.94%
pkg/ovs/ovs-vsctl.go 3 52.07%
Totals Coverage Status
Change from base Build 18455311602: -0.02%
Covered Lines: 10705
Relevant Lines: 50813

💛 - Coveralls

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the interface cleanup mechanism in the OVS daemon to use a Kubernetes pod lister for garbage collection instead of cleaning all lost interfaces. It introduces a more targeted approach to clean up OVS interfaces only when the corresponding pods don't exist in the cluster.

  • Replaced the generic CleanLostInterface function with a new gcInterfaces method that checks pod existence before cleanup
  • Added ListInterfacePodMap to map OVS interfaces to their corresponding pods
  • Refactored CleanLostInterface into a reusable CleanInterface function with proper error handling

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
pkg/ovs/ovs-vsctl.go Added ListInterfacePodMap and refactored cleanup logic into CleanInterface
pkg/daemon/controller.go Implemented gcInterfaces method using pod lister for targeted cleanup
pkg/ovs/ovs-vsctl_test.go Removed obsolete test for the old cleanup function
pkg/ovs/ovn-nb-suite_test.go Removed obsolete test case for the old cleanup function

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the final PR Bugbot will review for you during this billing cycle

Your free Bugbot reviews will reset on November 30

Details

You are on the Bugbot Free tier. On this plan, Bugbot will review limited PRs each billing cycle.

To receive Bugbot reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

continue
}

podNamespace, podName, errText := parts[0], parts[1], parts[2]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Pod String Parsing Error Affects OVS Cleanup

The gcInterfaces function incorrectly parses the pod string when the error text contains forward slashes. strings.Split truncates the error message, which can cause the "No such device" check to fail and prevent proper cleanup of orphaned OVS interfaces.

Fix in Cursor Fix in Web

klog.Infof("delete lost port %s", name)
output, err := Exec("--if-exists", "--with-iface", "del-port", name)
}
result[ifaceName] = fmt.Sprintf("%s/%s/%s", podNamespace, podName, errText)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Pod Parsing Error Due to Substring Matching

The ListInterfacePodMap function's parsing of external_ids for pod_name and pod_namespace is too broad. Using strings.Contains can match substrings, leading to incorrect TrimPrefix results and potentially empty pod names or namespaces. This creates malformed entries in the returned map, which can cause issues for downstream functions relying on these values.

Fix in Cursor Fix in Web

@oilbeater oilbeater merged commit f2c155c into master Oct 13, 2025
136 of 141 checks passed
@oilbeater oilbeater deleted the gc-interface branch October 13, 2025 09:26
dnugmanov added a commit to dnugmanov/kube-ovn that referenced this pull request Oct 21, 2025
The garbage collector introduced in PR kubeovn#5789 incorrectly identifies
KubeVirt VM interfaces as 'lost' and deletes them, breaking network
connectivity for VMs within ~60 seconds of creation.

Root cause: For KubeVirt VMs, the pod_name in OVS external_ids is set
to the VM name (e.g., 'ubuntu-vm-br'), not the launcher pod name
(e.g., 'virt-launcher-ubuntu-vm-br-xyz'). When gcInterfaces() tries
to look up the pod by the VM name, it fails and incorrectly deletes
the interface.

Solution: When a pod is not found by direct lookup, check if it might
be a KubeVirt VM by searching for launcher pods with the label
'vm.kubevirt.io/name' matching the pod_name from OVS. If a matching
launcher pod exists, keep the interface instead of deleting it.

This fix maintains backward compatibility with non-KubeVirt pods
while preventing false-positive deletions for KubeVirt VMs.

Signed-off-by: Damir Nugmanov <damir_nug@mail.ru>
dnugmanov added a commit to dnugmanov/kube-ovn that referenced this pull request Oct 21, 2025
The garbage collector introduced in PR kubeovn#5789 incorrectly identifies
KubeVirt VM interfaces as 'lost' and deletes them, breaking network
connectivity for VMs within ~60 seconds of creation.

Root cause: For KubeVirt VMs, the pod_name in OVS external_ids is set
to the VM name (e.g., 'ubuntu-vm-br'), not the launcher pod name
(e.g., 'virt-launcher-ubuntu-vm-br-xyz'). When gcInterfaces() tries
to look up the pod by the VM name, it fails and incorrectly deletes
the interface.

Solution: When a pod is not found by direct lookup, check if it might
be a KubeVirt VM by searching for launcher pods with the label
'vm.kubevirt.io/name' matching the pod_name from OVS. If a matching
launcher pod exists, keep the interface instead of deleting it.

This fix maintains backward compatibility with non-KubeVirt pods
while preventing false-positive deletions for KubeVirt VMs.

Signed-off-by: Damir Nugmanov <damir_nug@mail.ru>
Signed-off-by: dnugmanov <damir_nug@mail.ru>
dnugmanov added a commit to dnugmanov/kube-ovn that referenced this pull request Oct 22, 2025
The garbage collector introduced in PR kubeovn#5789 incorrectly identifies
KubeVirt VM interfaces as 'lost' and deletes them, breaking network
connectivity for VMs within ~60 seconds of creation.

Root cause: For KubeVirt VMs, the pod_name in OVS external_ids is set
to the VM name (e.g., 'ubuntu-vm-br'), not the launcher pod name
(e.g., 'virt-launcher-ubuntu-vm-br-xyz'). When gcInterfaces() tries
to look up the pod by the VM name, it fails and incorrectly deletes
the interface.

Solution: When a pod is not found by direct lookup, check if it might
be a KubeVirt VM by searching for launcher pods with the label
'vm.kubevirt.io/name' matching the pod_name from OVS. If a matching
launcher pod exists, keep the interface instead of deleting it.

This fix maintains backward compatibility with non-KubeVirt pods
while preventing false-positive deletions for KubeVirt VMs.

Signed-off-by: dnugmanov <damir_nug@mail.ru>
oilbeater pushed a commit that referenced this pull request Oct 22, 2025
The garbage collector introduced in PR #5789 incorrectly identifies
KubeVirt VM interfaces as 'lost' and deletes them, breaking network
connectivity for VMs within ~60 seconds of creation.

Root cause: For KubeVirt VMs, the pod_name in OVS external_ids is set
to the VM name (e.g., 'ubuntu-vm-br'), not the launcher pod name
(e.g., 'virt-launcher-ubuntu-vm-br-xyz'). When gcInterfaces() tries
to look up the pod by the VM name, it fails and incorrectly deletes
the interface.

Solution: When a pod is not found by direct lookup, check if it might
be a KubeVirt VM by searching for launcher pods with the label
'vm.kubevirt.io/name' matching the pod_name from OVS. If a matching
launcher pod exists, keep the interface instead of deleting it.

This fix maintains backward compatibility with non-KubeVirt pods
while preventing false-positive deletions for KubeVirt VMs.

Signed-off-by: dnugmanov <damir_nug@mail.ru>
oilbeater added a commit that referenced this pull request Dec 1, 2025
Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
@mengyu0829
Copy link
Copy Markdown
Contributor

mengyu0829 commented Dec 4, 2025

您好!我们用这个版本验证发现gcInterface有3个问题,麻烦看看呢,谢谢!

  1. gc前查询interface是逗号拼接(name,external_ids,error),然后逗号分隔,不等于3的直接跳过了,当开启ip双栈,external_ids中会存在(ipv4,ipv6)形式的pod ip,这样按逗号分隔就会出错
1
  1. ListInterfacePodMap中只查询了interface有error的,还有一种情况error为空,但是admin_state和link_state都是down的,查询pod已不存在,也需要gc掉;
2
  1. 另外对statefulset类型的pod重启后,发现所处的chassis和pod实际所在的节点不一致,由于pod名是由sts名字和数字拼接,残留的名字和新建的名字一样,无法通过查询k8s中是否有该名字的pod来判断是否清理。

@oilbeater
Copy link
Copy Markdown
Collaborator Author

@mengyu0829 感谢发现的问题,前两个问题确实存在。第三个问题由于 daemon 在 list 的时候只 watch 和自己节点相关的 pod,如果 pod 已经到了另一个节点理论上本地的 cache 中应该已经不存在这个 pod 了。针对前两个问题不知道能否提一个 pr 来修复一下

@mengyu0829
Copy link
Copy Markdown
Contributor

@mengyu0829 感谢发现的问题,前两个问题确实存在。第三个问题由于 daemon 在 list 的时候只 watch 和自己节点相关的 pod,如果 pod 已经到了另一个节点理论上本地的 cache 中应该已经不存在这个 pod 了。针对前两个问题不知道能否提一个 pr 来修复一下

好的,前两个问题我们修改后验证下,第三个问题之前环境出现过,我再看看能不能复现。

@oilbeater
Copy link
Copy Markdown
Collaborator Author

第三个问题我能想到的一个兜底方式就是 Get 到这个 Pod 后再检查一下 nodeName 是不是和自己一致,如果再出现问题应该就是 client-go 本身机制上的问题了

oilbeater added a commit that referenced this pull request Dec 4, 2025
oilbeater added a commit that referenced this pull request Dec 4, 2025
This reverts commit 8886b91.

Signed-off-by: Mengxin Liu <liumengxinfly@gmail.com>
@mengyu0829
Copy link
Copy Markdown
Contributor

嗯,第三个问题复现了,现象是这样的:

  1. sts类型的pod 升级重启后调度到了poc03节点,状态running,ip和重启前一样,查看sbctl, 显示只绑定在pocnode12节点;
1
  1. 查看poc03节点的interface表,有该pod;
3
  1. 查看pocnode12节点的interface表,也有该pod;
2

@mengyu0829
Copy link
Copy Markdown
Contributor

另外补充下, 直接kubectl delete sts的pod复现不了, 可以eidt sts的副本数 (比如从3改成0,再从0改成3,)或者镜像tag就比较容易出现。

@oilbeater
Copy link
Copy Markdown
Collaborator Author

oilbeater commented Dec 5, 2025

我去复现一下,感觉比较奇怪,sts 缩容如果触发 cni del 事件也应该删除端口才对,不应该等到 gc 才去删除,能不能给个完整的 statefulset yaml 会怀疑和升级策略或者其他参数也有关系

@mengyu0829
Copy link
Copy Markdown
Contributor

我去复现一下,感觉比较奇怪,sts 缩容如果触发 cni del 事件也应该删除端口才对,不应该等到 gc 才去删除,能不能给个完整的 statefulset yaml 会怀疑和升级策略或者其他参数也有关系

test.yaml

@oilbeater
Copy link
Copy Markdown
Collaborator Author

我这里还是无法复现,能否看下在 Statefulset 的 pod 删除和创建过程中对应节点的 kube-ovn-cni 日志,看看是什么原因导致端口没有随着 pod 删除而删除

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] 有状态应用升级后pod网络无法访问,OVS残留脏数据

4 participants