Skip to content

Provide manual k3s stop and start option#5694

Open
mrangana wants to merge 1 commit intolf-edge:masterfrom
mrangana:k3s-stop-start
Open

Provide manual k3s stop and start option#5694
mrangana wants to merge 1 commit intolf-edge:masterfrom
mrangana:k3s-stop-start

Conversation

@mrangana
Copy link
Contributor

@mrangana mrangana commented Mar 19, 2026

Description

Adds k3s-control.sh, a new script installed as three symlinks (k3s-stop, k3s-start, k3s-status) that allows users to manually control k3s without interfering with EVE's normal restart supervision.

This is needed during the etcd restore of the downstream zks cluster. Where the kubernetes job scheduled to run this operation, will run a shell script on the node. The script needs to stop the k3s to release the etcd lock and will start k3s again after the restore completes. These restore jobs are orchestrated from zks-server hence no need to add new EVE API.

How it works:

  • k3s-stop sets a persistent stop flag at /var/lib/k3s-stop and terminates all k3s server processes (SIGTERM with SIGKILL fallback). The
    cluster-init.sh main loop respects this flag and stops attempting to restart k3s.
  • k3s-start removes the stop flag and creates a volatile signal flag at /run/kube/k3s-start. The cluster-init.sh loop detects this signal, resets the
    exponential backoff counter, and restarts k3s — avoiding unnecessary delay from prior crash backoff.
  • k3s-status reports whether k3s is running and whether the stop flag is present.

Changes:

  • pkg/kube/k3s-control.sh — new script; action dispatched via $1 or symlink name
  • pkg/kube/cluster-utils.sh — added K3S_STOP_FLAG and K3S_MANUAL_START_FLAG constants; added terminate_k3s() function
  • pkg/kube/cluster-init.sh — check_start_k3s() gates on stop flag; resets backoff on manual start signal
  • pkg/kube/test_k3s_control.sh — 58-test suite covering stop/start/status, backoff logic, and injection safety

How to test and validate this PR

Boot EVE in QEMU or on hardware with this image, SSH in, then run these scenarios manually:

A. Basic stop

k3s-status # confirm k3s is Running
k3s-stop # should print "k3s stopped"
k3s-status # should show: Stopped + Stop Flag: Present
ls /var/lib/k3s-stop # flag must exist

B. Verify supervisor loop respects the stop flag

After k3s-stop, wait 30s and confirm k3s stays down
sleep 30
k3s-status # must still show Stopped
pgrep -f "k3s server" # must return nothing

C. Basic start + backoff reset

k3s-start
ls /run/kube/k3s-start # manual-start flag must exist
ls /var/lib/k3s-stop # stop flag must be gone
//Wait for cluster-init.sh loop to pick it up (~5s)
sleep 10
k3s-status # should show Running

D. Stop flag survives reboot

k3s-stop
reboot
//After reboot, before k3s would normally start:
ls /var/lib/k3s-stop # must still be present
k3s-status # must show Stopped

E. Manual-start flag is cleared on reboot (volatile)

// /run/kube/k3s-start is on /run — it vanishes on reboot
// After a normal boot (no prior stop), k3s should start automatically
ls /run/kube/k3s-start # must NOT exist after clean boot

F. Full stop → reboot → start cycle

k3s-stop
reboot
//Confirm still stopped after reboot
k3s-status
//Now start
k3s-start
sleep 30
k3s-status # must show Running
kubectl get nodes # node must be Ready

G. Log verification

cat /persist/kubelog/k3s-install.log | grep -E "Manual k3s|stop|start|backoff"
Expect to see entries for each operation with correct timestamps.


  1. Regression check

Confirm normal EVE operation is unaffected — if neither flag is present, k3s starts and restarts automatically as before:

No flags present
ls /var/lib/k3s-stop # must not exist
ls /run/kube/k3s-start # must not exist
k3s-status # Running
//Kill k3s directly and confirm auto-restart
kill $(pgrep -f "k3s server")
sleep 30
k3s-status # must auto-recover to Running

Changelog notes

None

PR Backports

- 16.0-stable: No, as the feature is not available there.
- 14.5-stable: No, as the feature is not available there.
- 13.4-stable: No, as the feature is not available there.

Also, to the PRs that should be backported into any stable branch, please
add a label stable.

Checklist

  • I've provided a proper description
  • I've added the proper documentation
  • I've tested my PR on amd64 device
  • I've tested my PR on arm64 device
  • I've written the test verification instructions
  • I've set the proper labels to this PR

And the last but not least:

  • I've checked the boxes above, or I've provided a good reason why I didn't
    check them.

Please, check the boxes above after submitting the PR in interactive mode.

@codecov
Copy link

codecov bot commented Mar 19, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.45%. Comparing base (2281599) to head (1c94941).
⚠️ Report is 349 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5694      +/-   ##
==========================================
+ Coverage   19.52%   29.45%   +9.92%     
==========================================
  Files          19       18       -1     
  Lines        3021     2417     -604     
==========================================
+ Hits          590      712     +122     
+ Misses       2310     1554     -756     
- Partials      121      151      +30     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description talks about "users" but I assume that this is form internal debugging purposes. If this is for users then we need to add something to the EVE API to be able to enable/disable/restart k3s.
So it would be good to clarify the problem it is solving.

For instance, if it is for debugging purposes, then why does the stopped state need to survive a reboot?

@zedi-pramodh
Copy link

I think in this case the "user" is the zks upgrade process, basically zks upgrade controller will internally call these stop/start states to replace the k3s binary with new versions. Hence this PR intention is make sure those flags are set when requested and then also make sure regular cluster-init loop does not step on this process. That is what I understand @mrangana is that right ?

@mrangana
Copy link
Contributor Author

This is needed during the etcd restore of the downstream zks cluster. Where the kubernetes job scheduled to run this operation, will run a shell script on the node. The script needs to stop the k3s to release the etcd lock and will start k3s again after the restore completes. These restore jobs are orchestrated from zks-server hence no need to add new EVE API.

@@ -0,0 +1,69 @@
#!/bin/sh
#
# Copyright (c) 2024 Zededa, Inc.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, update the Copyright year.

@rene
Copy link
Contributor

rene commented Mar 23, 2026

@mrangana , you need to fix Yetus issues, Sign-Off your commit and fix the Copyright year.... you can take a look at https://github.com/lf-edge/eve/blob/master/CONTRIBUTING.md

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are 4 yetus issues in the annotated diffs and in the summary on the https://github.com/lf-edge/eve/actions/runs/23453698145?pr=5694 page.

Please review and fix.

Adds k3s-control.sh, a new script installed as three symlinks
(k3s-stop, k3s-start, k3s-status) that allows users to manually
control k3s without interfering with EVE's normal restart supervision.

This is needed during the etcd restore of the downstream zks cluster.
Where the kubernetes job scheduled to run this operation, will run a
shell script on the node. The script needs to stop the k3s to release
the etcd lock and will start k3s again after the restore completes.
These restore jobs are orchestrated from zks-server hence
no need to add new EVE API

Signed-off-by: Manjunath Ranganathaiah <manjunath@zededa.com>
@mrangana
Copy link
Contributor Author

Fixied the Yetus, sign-off issues. Regrading the build failure, my local build is succeeding with these make commands. Looks like need to build the packages first.

make V=1 PRUNE=1 PLATFORM=generic ZARCH=amd64 HV=k pkgs
make V=1 ROOTFS_VERSION="test" PLATFORM=generic HV=k ZARCH=amd64 eve

Copy link
Contributor

@eriknordmark eriknordmark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Run tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants