Provide manual k3s stop and start option by mrangana · Pull Request #5694 · lf-edge/eve

mrangana · 2026-03-19T22:33:26Z

Description

Adds k3s-control.sh, a new script installed as three symlinks (k3s-stop, k3s-start, k3s-status) that allows users to manually control k3s without interfering with EVE's normal restart supervision.

This is needed during the etcd restore of the downstream zks cluster. Where the kubernetes job scheduled to run this operation, will run a shell script on the node. The script needs to stop the k3s to release the etcd lock and will start k3s again after the restore completes. These restore jobs are orchestrated from zks-server hence no need to add new EVE API.

How it works:

k3s-stop sets a persistent stop flag at /var/lib/k3s-stop and terminates all k3s server processes (SIGTERM with SIGKILL fallback). The
cluster-init.sh main loop respects this flag and stops attempting to restart k3s.
k3s-start removes the stop flag and creates a volatile signal flag at /run/kube/k3s-start. The cluster-init.sh loop detects this signal, resets the
exponential backoff counter, and restarts k3s — avoiding unnecessary delay from prior crash backoff.
k3s-status reports whether k3s is running and whether the stop flag is present.

Changes:

pkg/kube/k3s-control.sh — new script; action dispatched via $1 or symlink name
pkg/kube/cluster-utils.sh — added K3S_STOP_FLAG and K3S_MANUAL_START_FLAG constants; added terminate_k3s() function
pkg/kube/cluster-init.sh — check_start_k3s() gates on stop flag; resets backoff on manual start signal
pkg/kube/test_k3s_control.sh — 58-test suite covering stop/start/status, backoff logic, and injection safety

How to test and validate this PR

Boot EVE in QEMU or on hardware with this image, SSH in, then run these scenarios manually:

A. Basic stop

k3s-status # confirm k3s is Running
k3s-stop # should print "k3s stopped"
k3s-status # should show: Stopped + Stop Flag: Present
ls /var/lib/k3s-stop # flag must exist

B. Verify supervisor loop respects the stop flag

After k3s-stop, wait 30s and confirm k3s stays down
sleep 30
k3s-status # must still show Stopped
pgrep -f "k3s server" # must return nothing

C. Basic start + backoff reset

k3s-start
ls /run/kube/k3s-start # manual-start flag must exist
ls /var/lib/k3s-stop # stop flag must be gone
//Wait for cluster-init.sh loop to pick it up (~5s)
sleep 10
k3s-status # should show Running

D. Stop flag survives reboot

k3s-stop
reboot
//After reboot, before k3s would normally start:
ls /var/lib/k3s-stop # must still be present
k3s-status # must show Stopped

E. Manual-start flag is cleared on reboot (volatile)

// /run/kube/k3s-start is on /run — it vanishes on reboot
// After a normal boot (no prior stop), k3s should start automatically
ls /run/kube/k3s-start # must NOT exist after clean boot

F. Full stop → reboot → start cycle

k3s-stop
reboot
//Confirm still stopped after reboot
k3s-status
//Now start
k3s-start
sleep 30
k3s-status # must show Running
kubectl get nodes # node must be Ready

G. Log verification

cat /persist/kubelog/k3s-install.log | grep -E "Manual k3s|stop|start|backoff"
Expect to see entries for each operation with correct timestamps.

Regression check

Confirm normal EVE operation is unaffected — if neither flag is present, k3s starts and restarts automatically as before:

No flags present
ls /var/lib/k3s-stop # must not exist
ls /run/kube/k3s-start # must not exist
k3s-status # Running
//Kill k3s directly and confirm auto-restart
kill $(pgrep -f "k3s server")
sleep 30
k3s-status # must auto-recover to Running

Changelog notes

None

PR Backports

- 16.0-stable: No, as the feature is not available there.
- 14.5-stable: No, as the feature is not available there.
- 13.4-stable: No, as the feature is not available there.

Also, to the PRs that should be backported into any stable branch, please
add a label stable.

Checklist

I've provided a proper description
I've added the proper documentation
I've tested my PR on amd64 device
I've tested my PR on arm64 device
I've written the test verification instructions
I've set the proper labels to this PR

And the last but not least:

I've checked the boxes above, or I've provided a good reason why I didn't
check them.

Please, check the boxes above after submitting the PR in interactive mode.

codecov · 2026-03-19T23:13:27Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 29.45%. Comparing base (2281599) to head (1c94941).
⚠️ Report is 349 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #5694      +/-   ##
==========================================
+ Coverage   19.52%   29.45%   +9.92%     
==========================================
  Files          19       18       -1     
  Lines        3021     2417     -604     
==========================================
+ Hits          590      712     +122     
+ Misses       2310     1554     -756     
- Partials      121      151      +30

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

eriknordmark

The description talks about "users" but I assume that this is form internal debugging purposes. If this is for users then we need to add something to the EVE API to be able to enable/disable/restart k3s.
So it would be good to clarify the problem it is solving.

For instance, if it is for debugging purposes, then why does the stopped state need to survive a reboot?

zedi-pramodh · 2026-03-20T22:44:43Z

I think in this case the "user" is the zks upgrade process, basically zks upgrade controller will internally call these stop/start states to replace the k3s binary with new versions. Hence this PR intention is make sure those flags are set when requested and then also make sure regular cluster-init loop does not step on this process. That is what I understand @mrangana is that right ?

mrangana · 2026-03-22T15:24:39Z

This is needed during the etcd restore of the downstream zks cluster. Where the kubernetes job scheduled to run this operation, will run a shell script on the node. The script needs to stop the k3s to release the etcd lock and will start k3s again after the restore completes. These restore jobs are orchestrated from zks-server hence no need to add new EVE API.

rene · 2026-03-23T10:30:26Z

pkg/kube/k3s-control.sh

@@ -0,0 +1,69 @@
+#!/bin/sh
+#
+# Copyright (c) 2024 Zededa, Inc.


Please, update the Copyright year.

rene · 2026-03-23T10:33:11Z

@mrangana , you need to fix Yetus issues, Sign-Off your commit and fix the Copyright year.... you can take a look at https://github.com/lf-edge/eve/blob/master/CONTRIBUTING.md

eriknordmark

There are 4 yetus issues in the annotated diffs and in the summary on the https://github.com/lf-edge/eve/actions/runs/23453698145?pr=5694 page.

Please review and fix.

Adds k3s-control.sh, a new script installed as three symlinks (k3s-stop, k3s-start, k3s-status) that allows users to manually control k3s without interfering with EVE's normal restart supervision. This is needed during the etcd restore of the downstream zks cluster. Where the kubernetes job scheduled to run this operation, will run a shell script on the node. The script needs to stop the k3s to release the etcd lock and will start k3s again after the restore completes. These restore jobs are orchestrated from zks-server hence no need to add new EVE API Signed-off-by: Manjunath Ranganathaiah <manjunath@zededa.com>

mrangana · 2026-03-23T22:44:35Z

Fixied the Yetus, sign-off issues. Regrading the build failure, my local build is succeeding with these make commands. Looks like need to build the packages first.

make V=1 PRUNE=1 PLATFORM=generic ZARCH=amd64 HV=k pkgs
make V=1 ROOTFS_VERSION="test" PLATFORM=generic HV=k ZARCH=amd64 eve

eriknordmark

Run tests

mrangana requested a review from zedi-pramodh as a code owner March 19, 2026 22:33

github-actions bot requested review from andrewd-zededa, eriknordmark and naiming-zededa March 19, 2026 22:33

mrangana force-pushed the k3s-stop-start branch from 9e4651f to 41cadb9 Compare March 20, 2026 18:25

eriknordmark reviewed Mar 20, 2026

View reviewed changes

rene reviewed Mar 23, 2026

View reviewed changes

mrangana force-pushed the k3s-stop-start branch from 41cadb9 to fe9cd41 Compare March 23, 2026 18:30

github-actions bot requested a review from eriknordmark March 23, 2026 18:30

eriknordmark reviewed Mar 23, 2026

View reviewed changes

mrangana force-pushed the k3s-stop-start branch from fe9cd41 to 1c94941 Compare March 23, 2026 21:34

github-actions bot requested a review from eriknordmark March 23, 2026 21:35

eriknordmark approved these changes Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide manual k3s stop and start option#5694

Provide manual k3s stop and start option#5694
mrangana wants to merge 1 commit intolf-edge:masterfrom
mrangana:k3s-stop-start

mrangana commented Mar 19, 2026 •

edited

Loading

Uh oh!

codecov bot commented Mar 19, 2026 •

edited

Loading

Uh oh!

eriknordmark left a comment

Uh oh!

zedi-pramodh commented Mar 20, 2026

Uh oh!

mrangana commented Mar 22, 2026

Uh oh!

rene Mar 23, 2026

Uh oh!

rene commented Mar 23, 2026

Uh oh!

eriknordmark left a comment

Uh oh!

mrangana commented Mar 23, 2026

Uh oh!

eriknordmark left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

mrangana commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

How to test and validate this PR

Changelog notes

PR Backports

Checklist

Uh oh!

codecov bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

eriknordmark left a comment

Choose a reason for hiding this comment

Uh oh!

zedi-pramodh commented Mar 20, 2026

Uh oh!

mrangana commented Mar 22, 2026

Uh oh!

rene Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

rene commented Mar 23, 2026

Uh oh!

eriknordmark left a comment

Choose a reason for hiding this comment

Uh oh!

mrangana commented Mar 23, 2026

Uh oh!

eriknordmark left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mrangana commented Mar 19, 2026 •

edited

Loading

codecov bot commented Mar 19, 2026 •

edited

Loading