Fix: remove stale flannel.1 before restart k3s by naiming-zededa · Pull Request #5672 · lf-edge/eve

naiming-zededa · 2026-03-13T04:00:54Z

Description

to fix a flannel v0.27.4 possiblly introduced a nil-pointer bug in watchVXLANDevice. During the k3s transition, had a SIGSEGV

   [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x3b9216e]

   goroutine 25048 [running]:
   github.com/flannel-io/flannel/pkg/backend/vxlan.(*network).watchVXLANDevice(0xc01e9e95c0, {0x81a0f10, 0xc000a7ec80}, 0xc0169a1730)
       /go/pkg/mod/github.com/flannel-io/flannel@v0.27.4/pkg/backend/vxlan/vxlan_network.go:138 +0x36e
   github.com/flannel-io/flannel/pkg/backend/vxlan.(*network).Run.func2()
       /go/pkg/mod/github.com/flannel-io/flannel@v0.27.4/pkg/backend/vxlan/vxlan_network.go:81 +0x30
   created by github.com/flannel-io/flannel/pkg/backend/vxlan.(*network).Run in goroutine 18390
       /go/pkg/mod/github.com/flannel-io/flannel@v0.27.4/pkg/backend/vxlan/vxlan_network.go:80 +0x23b. here is some analysis and a potential work-around.   Triggers in

this potential fix is a work-around to remove the stale flannel.1 before the k3s restart, so the flannel fresh restart won't hit this bug

PR dependencies

How to test and validate this PR

this panic from flannel has been seen twice, so it may not be easily reproduced.
but the testing needs to exercise multiple cluster transitions (single-node to multiple nodes, and vice versa) to make sure there is no issues from this patch, and cluster still works fine

Changelog notes

Fix: remove stale flannel.1 before restart k3s

PR Backports

Checklist

I've provided a proper description
I've added the proper documentation
I've tested my PR on amd64 device
I've tested my PR on arm64 device
I've written the test verification instructions
I've set the proper labels to this PR

For backport PRs (remove it if it's not a backport):

I've added a reference link to the original PR
PR's title follows the template

And the last but not least:

I've checked the boxes above, or I've provided a good reason why I didn't
check them.

- to fix a flannel v0.27.4 possiblly introduced a nil-pointer bug in watchVXLANDevice. During the k3s transition, had a SIGSEGV - this potential fix is to remove the stale flannel.1 before the k3s restart, so the flannel fresh restart won't hit this bug Signed-off-by: naiming-zededa <naiming@zededa.com>

codecov · 2026-03-13T04:53:07Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.49%. Comparing base (2281599) to head (8dd0b3b).
⚠️ Report is 346 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5672      +/-   ##
==========================================
+ Coverage   19.52%   29.49%   +9.96%     
==========================================
  Files          19       18       -1     
  Lines        3021     2417     -604     
==========================================
+ Hits          590      713     +123     
+ Misses       2310     1552     -758     
- Partials      121      152      +31

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

rene · 2026-03-16T11:29:29Z

@naiming-zededa , isn't any fix already available for flannel-io? It's better to integrate the proper fix....

rene · 2026-03-16T11:29:50Z

@naiming-zededa , here is what claude says:

The search results mention a relevant PR — let me look more closely at the watchVXLANDevice crash specifically at line 138 in vxlan_network.go, which matches your stacktrace exactly.The release notes mention a highly relevant fix. Let me look at PR #2272 more closely — it's about recreating the VXLAN device when the external interface is deleted/re-added, which is very close to your crash in watchVXLANDevice.I found the fix. Here's the summary:

Found It: PR #2272, Fixed in v0.27.4

Your crash in watchVXLANDevice at vxlan_network.go:138 is directly addressed by:

"Fix: recreate VXLAN device (flannel.*) when external interface is deleted and re-added" — PR #2272 by @pratikjagrut, shipped in v0.27.4

The crash is a nil pointer dereference in watchVXLANDevice — the goroutine watching the VXLAN netlink device hits a nil reference when the underlying external interface disappears (e.g., a network interface is briefly deleted and re-added, which can happen during network reconfiguration, node restarts, or cloud provider interface churn). PR #2272 fixes this by properly recreating the VXLAN device in that scenario rather than letting it dereference a stale/nil pointer.

You're on v0.27.4 (per your stacktrace module path), which means this fix is already included in your version. That tells us one of two things:

The fix is incomplete / has a regression — the same code path can still be triggered under certain conditions even after the PR.
You're hitting a different trigger of the same nil-dereference at line 138.

Recommended next steps

Check if there's a newer release: The current latest is also v0.27.4, so there's no newer patch to upgrade to right now.
Check if the issue has been reopened upstream: Search https://github.com/flannel-io/flannel/issues for watchVXLANDevice or vxlan_network.go:138 to see if others are reporting the same crash post-v0.27.4.
Workaround: Make your external network interface more stable — avoid scenarios where the physical/VM NIC flannel is bound to flaps. You can also try pinning flannel to a specific interface with --iface=<name> and ensuring it's always present before flannel starts.
As a short-term mitigation: Add a restartPolicy: Always with a backoff limit in your DaemonSet (already default in Kubernetes), so the pod automatically recovers after the crash.

naiming-zededa · 2026-03-16T16:46:40Z

Hey @rene , the k3s version we have, K3s v1.34.2+k3s1, the flannel v0.27.4 is included. actually it is this flannel PR 2272 introduced this crash. but we are not going to back down the k3s version (flannel is part of k3s).
and here is my claude saying:

 Now I can see the full picture. Rene is wrong, and his AI (Claude) gave him a confidently incorrect answer. Here's why:                                       
                                                                             
  The AI's mistake                                                                                                                                              
                                                                                                                                                                
  Rene's Claude told him PR #2272 fixes the crash at vxlan_network.go:138. But that's backwards — PR #2272 is what INTRODUCED the crash. The watchVXLANDevice 
  function did not exist before PR #2272. It was added by that very PR and shipped in v0.27.4.                                                                  
                                                                                                                                                                
  The AI even correctly noted "you're on v0.27.4, which means this fix is already included" — and then couldn't reconcile the contradiction that the crash is   
  still happening. It hand-waved with "the fix is incomplete / has a regression" without actually verifying the code.                                           

  The facts

  1. PR #2272 merged → v0.27.4 (Oct 1, 2025). It added the watchVXLANDevice() function and the reCreateVxlan() logic.
  2. K3s v1.34.2+k3s1 uses flannel v0.27.4 — so it includes this new code.
  3. The SIGSEGV at vxlan_network.go:138 is in watchVXLANDevice — code that only exists because of PR #2272.
  4. The bug: nw.dev.link can be nil (the nil check on line 117 only guards nw.dev, not nw.dev.link), and there's no mutex protecting nw.dev which is written to
   by reCreateVxlan in a separate goroutine.

  What to tell Rene

  PR #2272 doesn't fix this crash — it causes it. The watchVXLANDevice function introduced by that PR has:
  - An insufficient nil guard (doesn't check nw.dev.link)
  - A data race on nw.dev between the watcher goroutine and the reCreateVxlan goroutine (no synchronization)

  Your workaround (removing stale flannel.1 before restarting k3s) is the right approach for now. The proper upstream fix needs to come from flannel — either in
   a new PR to v0.28.x or a patch to v0.27.5.

naiming-zededa · 2026-03-16T16:51:04Z

Also, regardless they will fix this crash later or not, this PR is a protective measure, it is good to have anyway.

rene · 2026-03-16T17:00:24Z

Also, regardless they will fix this crash later or not, this PR is a protective measure, it is good to have anyway.

sure, but I was looking for the proper fix.....

naiming-zededa requested a review from zedi-pramodh as a code owner March 13, 2026 04:00

github-actions bot requested review from andrewd-zededa and eriknordmark March 13, 2026 04:01

naiming-zededa removed the request for review from eriknordmark March 13, 2026 04:01

rene approved these changes Mar 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: remove stale flannel.1 before restart k3s#5672

Fix: remove stale flannel.1 before restart k3s#5672
naiming-zededa wants to merge 1 commit intolf-edge:masterfrom
naiming-zededa:naiming-flannel-removal

naiming-zededa commented Mar 13, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 13, 2026 •

edited

Loading

Uh oh!

rene commented Mar 16, 2026

Uh oh!

rene commented Mar 16, 2026

Uh oh!

naiming-zededa commented Mar 16, 2026

Uh oh!

naiming-zededa commented Mar 16, 2026

Uh oh!

rene commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

naiming-zededa commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

PR dependencies

How to test and validate this PR

Changelog notes

PR Backports

Checklist

Uh oh!

codecov bot commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

rene commented Mar 16, 2026

Uh oh!

rene commented Mar 16, 2026

Found It: PR #2272, Fixed in v0.27.4

Recommended next steps

Uh oh!

naiming-zededa commented Mar 16, 2026

Uh oh!

naiming-zededa commented Mar 16, 2026

Uh oh!

rene commented Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

naiming-zededa commented Mar 13, 2026 •

edited

Loading

codecov bot commented Mar 13, 2026 •

edited

Loading