Skip to content

Fix: remove stale flannel.1 before restart k3s#5672

Open
naiming-zededa wants to merge 1 commit intolf-edge:masterfrom
naiming-zededa:naiming-flannel-removal
Open

Fix: remove stale flannel.1 before restart k3s#5672
naiming-zededa wants to merge 1 commit intolf-edge:masterfrom
naiming-zededa:naiming-flannel-removal

Conversation

@naiming-zededa
Copy link
Contributor

@naiming-zededa naiming-zededa commented Mar 13, 2026

Description

  • to fix a flannel v0.27.4 possiblly introduced a nil-pointer bug in watchVXLANDevice. During the k3s transition, had a SIGSEGV
   [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x3b9216e]

   goroutine 25048 [running]:
   github.com/flannel-io/flannel/pkg/backend/vxlan.(*network).watchVXLANDevice(0xc01e9e95c0, {0x81a0f10, 0xc000a7ec80}, 0xc0169a1730)
       /go/pkg/mod/github.com/flannel-io/flannel@v0.27.4/pkg/backend/vxlan/vxlan_network.go:138 +0x36e
   github.com/flannel-io/flannel/pkg/backend/vxlan.(*network).Run.func2()
       /go/pkg/mod/github.com/flannel-io/flannel@v0.27.4/pkg/backend/vxlan/vxlan_network.go:81 +0x30
   created by github.com/flannel-io/flannel/pkg/backend/vxlan.(*network).Run in goroutine 18390
       /go/pkg/mod/github.com/flannel-io/flannel@v0.27.4/pkg/backend/vxlan/vxlan_network.go:80 +0x23b. here is some analysis and a potential work-around.   Triggers in
  • this potential fix is a work-around to remove the stale flannel.1 before the k3s restart, so the flannel fresh restart won't hit this bug

PR dependencies

How to test and validate this PR

this panic from flannel has been seen twice, so it may not be easily reproduced.
but the testing needs to exercise multiple cluster transitions (single-node to multiple nodes, and vice versa) to make sure there is no issues from this patch, and cluster still works fine

Changelog notes

Fix: remove stale flannel.1 before restart k3s

PR Backports

Checklist

  • I've provided a proper description
  • I've added the proper documentation
  • I've tested my PR on amd64 device
  • I've tested my PR on arm64 device
  • I've written the test verification instructions
  • I've set the proper labels to this PR

For backport PRs (remove it if it's not a backport):

  • I've added a reference link to the original PR
  • PR's title follows the template

And the last but not least:

  • I've checked the boxes above, or I've provided a good reason why I didn't
    check them.

- to fix a flannel v0.27.4 possiblly introduced a nil-pointer bug in
  watchVXLANDevice. During the k3s transition, had a SIGSEGV
- this potential fix is to remove the stale flannel.1 before the k3s
  restart, so the flannel fresh restart won't hit this bug

Signed-off-by: naiming-zededa <naiming@zededa.com>
@codecov
Copy link

codecov bot commented Mar 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.49%. Comparing base (2281599) to head (8dd0b3b).
⚠️ Report is 346 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5672      +/-   ##
==========================================
+ Coverage   19.52%   29.49%   +9.96%     
==========================================
  Files          19       18       -1     
  Lines        3021     2417     -604     
==========================================
+ Hits          590      713     +123     
+ Misses       2310     1552     -758     
- Partials      121      152      +31     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rene
Copy link
Contributor

rene commented Mar 16, 2026

@naiming-zededa , isn't any fix already available for flannel-io? It's better to integrate the proper fix....

@rene
Copy link
Contributor

rene commented Mar 16, 2026

@naiming-zededa , here is what claude says:


The search results mention a relevant PR — let me look more closely at the watchVXLANDevice crash specifically at line 138 in vxlan_network.go, which matches your stacktrace exactly.The release notes mention a highly relevant fix. Let me look at PR #2272 more closely — it's about recreating the VXLAN device when the external interface is deleted/re-added, which is very close to your crash in watchVXLANDevice.I found the fix. Here's the summary:


Found It: PR #2272, Fixed in v0.27.4

Your crash in watchVXLANDevice at vxlan_network.go:138 is directly addressed by:

"Fix: recreate VXLAN device (flannel.*) when external interface is deleted and re-added" — PR #2272 by @pratikjagrut, shipped in v0.27.4

The crash is a nil pointer dereference in watchVXLANDevice — the goroutine watching the VXLAN netlink device hits a nil reference when the underlying external interface disappears (e.g., a network interface is briefly deleted and re-added, which can happen during network reconfiguration, node restarts, or cloud provider interface churn). PR #2272 fixes this by properly recreating the VXLAN device in that scenario rather than letting it dereference a stale/nil pointer.

You're on v0.27.4 (per your stacktrace module path), which means this fix is already included in your version. That tells us one of two things:

  1. The fix is incomplete / has a regression — the same code path can still be triggered under certain conditions even after the PR.
  2. You're hitting a different trigger of the same nil-dereference at line 138.

Recommended next steps

  • Check if there's a newer release: The current latest is also v0.27.4, so there's no newer patch to upgrade to right now.
  • Check if the issue has been reopened upstream: Search https://github.com/flannel-io/flannel/issues for watchVXLANDevice or vxlan_network.go:138 to see if others are reporting the same crash post-v0.27.4.
  • Workaround: Make your external network interface more stable — avoid scenarios where the physical/VM NIC flannel is bound to flaps. You can also try pinning flannel to a specific interface with --iface=<name> and ensuring it's always present before flannel starts.
  • As a short-term mitigation: Add a restartPolicy: Always with a backoff limit in your DaemonSet (already default in Kubernetes), so the pod automatically recovers after the crash.

@naiming-zededa
Copy link
Contributor Author

Hey @rene , the k3s version we have, K3s v1.34.2+k3s1, the flannel v0.27.4 is included. actually it is this flannel PR 2272 introduced this crash. but we are not going to back down the k3s version (flannel is part of k3s).
and here is my claude saying:

 Now I can see the full picture. Rene is wrong, and his AI (Claude) gave him a confidently incorrect answer. Here's why:                                       
                                                                             
  The AI's mistake                                                                                                                                              
                                                                                                                                                                
  Rene's Claude told him PR #2272 fixes the crash at vxlan_network.go:138. But that's backwards — PR #2272 is what INTRODUCED the crash. The watchVXLANDevice 
  function did not exist before PR #2272. It was added by that very PR and shipped in v0.27.4.                                                                  
                                                                                                                                                                
  The AI even correctly noted "you're on v0.27.4, which means this fix is already included" — and then couldn't reconcile the contradiction that the crash is   
  still happening. It hand-waved with "the fix is incomplete / has a regression" without actually verifying the code.                                           

  The facts

  1. PR #2272 merged → v0.27.4 (Oct 1, 2025). It added the watchVXLANDevice() function and the reCreateVxlan() logic.
  2. K3s v1.34.2+k3s1 uses flannel v0.27.4 — so it includes this new code.
  3. The SIGSEGV at vxlan_network.go:138 is in watchVXLANDevice — code that only exists because of PR #2272.
  4. The bug: nw.dev.link can be nil (the nil check on line 117 only guards nw.dev, not nw.dev.link), and there's no mutex protecting nw.dev which is written to
   by reCreateVxlan in a separate goroutine.

  What to tell Rene

  PR #2272 doesn't fix this crash — it causes it. The watchVXLANDevice function introduced by that PR has:
  - An insufficient nil guard (doesn't check nw.dev.link)
  - A data race on nw.dev between the watcher goroutine and the reCreateVxlan goroutine (no synchronization)

  Your workaround (removing stale flannel.1 before restarting k3s) is the right approach for now. The proper upstream fix needs to come from flannel — either in
   a new PR to v0.28.x or a patch to v0.27.5.

@naiming-zededa
Copy link
Contributor Author

Also, regardless they will fix this crash later or not, this PR is a protective measure, it is good to have anyway.

@rene
Copy link
Contributor

rene commented Mar 16, 2026

Also, regardless they will fix this crash later or not, this PR is a protective measure, it is good to have anyway.

sure, but I was looking for the proper fix.....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants