Provide manual k3s stop and start option#5694
Provide manual k3s stop and start option#5694mrangana wants to merge 1 commit intolf-edge:masterfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #5694 +/- ##
==========================================
+ Coverage 19.52% 29.45% +9.92%
==========================================
Files 19 18 -1
Lines 3021 2417 -604
==========================================
+ Hits 590 712 +122
+ Misses 2310 1554 -756
- Partials 121 151 +30 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
9e4651f to
41cadb9
Compare
eriknordmark
left a comment
There was a problem hiding this comment.
The description talks about "users" but I assume that this is form internal debugging purposes. If this is for users then we need to add something to the EVE API to be able to enable/disable/restart k3s.
So it would be good to clarify the problem it is solving.
For instance, if it is for debugging purposes, then why does the stopped state need to survive a reboot?
|
I think in this case the "user" is the zks upgrade process, basically zks upgrade controller will internally call these stop/start states to replace the k3s binary with new versions. Hence this PR intention is make sure those flags are set when requested and then also make sure regular cluster-init loop does not step on this process. That is what I understand @mrangana is that right ? |
|
This is needed during the etcd restore of the downstream zks cluster. Where the kubernetes job scheduled to run this operation, will run a shell script on the node. The script needs to stop the k3s to release the etcd lock and will start k3s again after the restore completes. These restore jobs are orchestrated from zks-server hence no need to add new EVE API. |
pkg/kube/k3s-control.sh
Outdated
| @@ -0,0 +1,69 @@ | |||
| #!/bin/sh | |||
| # | |||
| # Copyright (c) 2024 Zededa, Inc. | |||
There was a problem hiding this comment.
Please, update the Copyright year.
|
@mrangana , you need to fix Yetus issues, Sign-Off your commit and fix the Copyright year.... you can take a look at https://github.com/lf-edge/eve/blob/master/CONTRIBUTING.md |
41cadb9 to
fe9cd41
Compare
eriknordmark
left a comment
There was a problem hiding this comment.
There are 4 yetus issues in the annotated diffs and in the summary on the https://github.com/lf-edge/eve/actions/runs/23453698145?pr=5694 page.
Please review and fix.
Adds k3s-control.sh, a new script installed as three symlinks (k3s-stop, k3s-start, k3s-status) that allows users to manually control k3s without interfering with EVE's normal restart supervision. This is needed during the etcd restore of the downstream zks cluster. Where the kubernetes job scheduled to run this operation, will run a shell script on the node. The script needs to stop the k3s to release the etcd lock and will start k3s again after the restore completes. These restore jobs are orchestrated from zks-server hence no need to add new EVE API Signed-off-by: Manjunath Ranganathaiah <manjunath@zededa.com>
fe9cd41 to
1c94941
Compare
|
Fixied the Yetus, sign-off issues. Regrading the build failure, my local build is succeeding with these make commands. Looks like need to build the packages first. make V=1 PRUNE=1 PLATFORM=generic ZARCH=amd64 HV=k pkgs |
Description
Adds k3s-control.sh, a new script installed as three symlinks (k3s-stop, k3s-start, k3s-status) that allows users to manually control k3s without interfering with EVE's normal restart supervision.
This is needed during the etcd restore of the downstream zks cluster. Where the kubernetes job scheduled to run this operation, will run a shell script on the node. The script needs to stop the k3s to release the etcd lock and will start k3s again after the restore completes. These restore jobs are orchestrated from zks-server hence no need to add new EVE API.
How it works:
cluster-init.sh main loop respects this flag and stops attempting to restart k3s.
exponential backoff counter, and restarts k3s — avoiding unnecessary delay from prior crash backoff.
Changes:
How to test and validate this PR
Boot EVE in QEMU or on hardware with this image, SSH in, then run these scenarios manually:
A. Basic stop
k3s-status # confirm k3s is Running
k3s-stop # should print "k3s stopped"
k3s-status # should show: Stopped + Stop Flag: Present
ls /var/lib/k3s-stop # flag must exist
B. Verify supervisor loop respects the stop flag
After k3s-stop, wait 30s and confirm k3s stays down
sleep 30
k3s-status # must still show Stopped
pgrep -f "k3s server" # must return nothing
C. Basic start + backoff reset
k3s-start
ls /run/kube/k3s-start # manual-start flag must exist
ls /var/lib/k3s-stop # stop flag must be gone
//Wait for cluster-init.sh loop to pick it up (~5s)
sleep 10
k3s-status # should show Running
D. Stop flag survives reboot
k3s-stop
reboot
//After reboot, before k3s would normally start:
ls /var/lib/k3s-stop # must still be present
k3s-status # must show Stopped
E. Manual-start flag is cleared on reboot (volatile)
// /run/kube/k3s-start is on /run — it vanishes on reboot
// After a normal boot (no prior stop), k3s should start automatically
ls /run/kube/k3s-start # must NOT exist after clean boot
F. Full stop → reboot → start cycle
k3s-stop
reboot
//Confirm still stopped after reboot
k3s-status
//Now start
k3s-start
sleep 30
k3s-status # must show Running
kubectl get nodes # node must be Ready
G. Log verification
cat /persist/kubelog/k3s-install.log | grep -E "Manual k3s|stop|start|backoff"
Expect to see entries for each operation with correct timestamps.
Confirm normal EVE operation is unaffected — if neither flag is present, k3s starts and restarts automatically as before:
No flags present
ls /var/lib/k3s-stop # must not exist
ls /run/kube/k3s-start # must not exist
k3s-status # Running
//Kill k3s directly and confirm auto-restart
kill $(pgrep -f "k3s server")
sleep 30
k3s-status # must auto-recover to Running
Changelog notes
None
PR Backports
Also, to the PRs that should be backported into any stable branch, please
add a label
stable.Checklist
And the last but not least:
check them.
Please, check the boxes above after submitting the PR in interactive mode.