Documents describing eve-k app deployment and failover by zedi-pramodh · Pull Request #5658 · lf-edge/eve

zedi-pramodh · 2026-03-05T18:53:28Z

These documents are generated using AI tools and seems correct. These documents cover the app deployment on eve-k, failover scenarios in 3 node and tie-breaker mode, and general overview of cluster-init.sh

How to test and validate this PR

Nothing to test. just documentation.

Changelog notes

None

PR Backports

Checklist

I've provided a proper description
I've added the proper documentation
I've checked the boxes above, or I've provided a good reason why I didn't
check them.

Please, check the boxes above after submitting the PR in interactive mode.

These documents are generated using AI tools and seems correct. These documents cover the app deployment on eve-k, failover scenarios in 3 node and tie-breaker mode, and general overview of cluster-init.sh Signed-off-by: Pramodh Pallapothu <pramodh@zededa.com>

eriknordmark · 2026-03-05T20:23:16Z

docs/EVE-K-3NODE-FAILOVER.md

+
+### Node roles vs. Tie-Breaker topology
+
+```


Note that there is a markdownlint issue here.

and in several other places throughout the docs....

rene · 2026-03-06T11:26:40Z

docs/EVE-K-3NODE-FAILOVER.md

+  ┌───────────────────────────────────────────────────────────────┐
+  │              3-Node Full Control-Plane Cluster                │
+  │                                                               │
+  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐        │


there are several misaligned | characters throughout the docs....

rene · 2026-03-06T11:31:22Z

docs/EVE-K-3NODE-FAILOVER.md

+| File | Relevant Scenarios |
+|------|--------------------|
+| `pkg/pillar/types/clustertypes.go` | Config structure, TieBreakerNodeID field |
+| `pkg/pillar/cmd/zedkube/failover.go` | 1, 2 (identical logic to tie-breaker) |


Perhaps it could point only to the major packages instead of single files (e.g. pkg/pillar/cmd/zedkube, pkg/pillar/kubeapi, etc). It can get out of sync with the code very easy as it is now....

Copilot

Pull request overview

Adds new documentation describing EVE-K app deployment, cluster failover behaviors (3-node and tie-breaker topologies), and a detailed flow diagram for pkg/kube/cluster-init.sh.

Changes:

Added an AI-generated, step-by-step flow/flag reference for cluster-init.sh.
Added tie-breaker and full 3-node failover scenario walkthroughs and decision trees.
Added an app deployment microservice/pipeline diagram and pubsub/source-file references.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 12 comments.

File	Description
pkg/kube/CLUSTER-INIT-FLOW.md	New detailed flow/flags/reboot summary for `cluster-init.sh` behavior.
docs/EVE-K-TIEBREAKER-FAILOVER.md	New failover scenarios for 2+1 (tie-breaker) topology, timeouts, and code references.
docs/EVE-K-APP-DEPLOYMENT.md	New end-to-end app deployment flow diagrams, pubsub table, and key file pointers.
docs/EVE-K-3NODE-FAILOVER.md	New failover scenarios for full 3-node control-plane topology plus comparisons and config notes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-06T11:33:00Z

docs/EVE-K-TIEBREAKER-FAILOVER.md

+| Parameter | Value | Configurable | Purpose |
+|-----------|-------|-------------|---------|
+| Node NotReady detection | ~60s | No (K8s default) | Mark node unavailable |
+| Pod eviction after NotReady | ~30s | No (K8s default) | Start pod eviction |
+| getKubePodsError threshold | 2 min | No (hardcoded) | Mark app not running |
+| DetachOldWorkload trigger | 2 min | No (hardcoded) | Force volume detach |
+| Lease LeaseDuration | 300s | No | Leader holds lease |
+| Lease RenewDeadline | 180s | No | Renew before expiry |
+| Lease RetryPeriod | 15s | No | Election retry interval |
+| drainSkipK8sAPINotReachableTimeout | 300s | Yes | Skip drain if API down |
+| KubernetesDrainTimeout | 24h | Yes | Max drain wait |
+| Drain cordon retries | 10× / 5s | No | Cordon retry policy |
+| Drain eviction retries | 5× / 300s | No | Drain retry policy |
+


The Timeout Reference table uses || at the start of rows, which will render as an extra empty column in many Markdown renderers. Please switch to a single leading | for each row (including the header and separator).

Copilot · 2026-03-06T11:33:00Z

docs/EVE-K-APP-DEPLOYMENT.md

+| From | To | Message | Contains |
+|------|----|---------|----------|
+| zedagent | zedmanager, zedkube | `AppInstanceConfig` | App spec, volumes, nets, virtualization mode |
+| zedmanager | volumemgr | `VolumeRefConfig` | Volume UUID, size, image ref |
+| zedmanager | zedrouter | `AppNetworkConfig` | Network instance UUIDs, VIF list |
+| zedmanager | domainmgr | `DomainConfig` | vCPUs, RAM, disks (PVC names), VifList, mode |
+| volumemgr | zedmanager | `VolumeRefStatus` | PVC name as `ActiveFileLocation` |
+| zedrouter | zedmanager | `AppNetworkStatus` | VIF assignments, MACs |
+| domainmgr | zedmanager | `DomainStatus` | Running/Pending/Failed, metrics |
+| nim | zedrouter, zedkube | `DeviceNetworkStatus` | Physical port assignments |
+| zedkube | (reporting) | `ENClusterAppStatus` | Cluster-level app health |
+


These tables start with || (double pipe), which typically renders as an extra empty column. Please change the table rows to start with a single | so the Pubsub message table renders correctly in GitHub Markdown.

Copilot · 2026-03-06T11:33:01Z

docs/EVE-K-3NODE-FAILOVER.md

+| Parameter | Value | Configurable | Notes |
+|-----------|-------|-------------|-------|
+| Node NotReady detection | ~60s | No | K8s default |
+| Pod eviction after NotReady | ~30s | No | K8s default |
+| DetachOldWorkload trigger | 2 min | No | Same as tie-breaker |
+| Lease LeaseDuration | 300s | No | Same as tie-breaker |
+| Lease RenewDeadline | 180s | No | 2 candidates race in 3-master |
+| Lease RetryPeriod | 15s | No | Same as tie-breaker |
+| drainSkipK8sAPINotReachableTimeout | 300s | Yes | Same as tie-breaker |
+| KubernetesDrainTimeout | 24h | Yes | Same as tie-breaker |
+| Longhorn rebuild start | ~60-120s | No | After replica offline |
+


The Timeout Reference table rows start with ||, which renders as an extra empty column in many Markdown renderers. Please use a single leading | for the header, separator, and data rows.

Copilot · 2026-03-06T11:33:01Z

docs/EVE-K-3NODE-FAILOVER.md

+// pkg/pillar/types/clustertypes.go
+
+type EdgeNodeClusterConfig struct {
+    ClusterID        uuid.UUID         // Cluster identifier
+    ClusterInterface string            // Network interface for cluster
+    ClusterIPPrefix  net.IPNet         // Cluster IP
+    IsWorkerNode     bool              // This node runs workloads
+    BootstrapNode    bool              // First node to initialize cluster
+    TieBreakerNodeID UUIDandVersion    // UUID of tie-breaker (UNSET = full 3-master)
+    JoinServerIP     net.IP            // Existing node to join
+    EncryptedClusterToken string       // Bootstrap token
+    ClusterContext   string            // Context name in kubeconfig
+}


The EdgeNodeClusterConfig struct shown here does not match the current definition in pkg/pillar/types/clustertypes.go (e.g., ClusterID is UUIDandVersion, ClusterIPPrefix is *net.IPNet, encrypted token fields are represented via CipherToken, and TieBreakerNodeID is UUIDandVersion). Please update the snippet to match the current code or replace it with a link to the source to avoid future drift.

Suggested change

// pkg/pillar/types/clustertypes.go

type EdgeNodeClusterConfig struct {

ClusterID uuid.UUID // Cluster identifier

ClusterInterface string // Network interface for cluster

ClusterIPPrefix net.IPNet // Cluster IP

IsWorkerNode bool // This node runs workloads

BootstrapNode bool // First node to initialize cluster

TieBreakerNodeID UUIDandVersion // UUID of tie-breaker (UNSET = full 3-master)

JoinServerIP net.IP // Existing node to join

EncryptedClusterToken string // Bootstrap token

ClusterContext string // Context name in kubeconfig

}

// For the latest definition of EdgeNodeClusterConfig, see:

// https://github.com/lf-edge/eve/blob/main/pkg/pillar/types/clustertypes.go

Copilot · 2026-03-06T11:33:01Z

docs/EVE-K-APP-DEPLOYMENT.md

+| Agent | Path | Key Function |
+|-------|------|-------------|
+| zedagent | `pkg/pillar/cmd/zedagent/zedagent.go` | Controller config reception |
+| zedmanager | `pkg/pillar/cmd/zedmanager/zedmanager.go` | Orchestration hub |
+| zedmanager | `pkg/pillar/cmd/zedmanager/handledomainmgr.go` | `MaybeAddDomainConfig()` |
+| zedmanager | `pkg/pillar/cmd/zedmanager/handlevolumemgr.go` | Publishes VolumeRefConfig |
+| volumemgr | `pkg/pillar/cmd/volumemgr/blob.go` | PVC creation for HV=k |
+| zedrouter | `pkg/pillar/cmd/zedrouter/zedrouter.go` | Network instance + CNI setup |
+| zedrouter | `pkg/pillar/cmd/zedrouter/cni.go` | eve-bridge RPC server |
+| domainmgr | `pkg/pillar/cmd/domainmgr/domainmgr.go` | HV=k detection, hypervisor call |
+| zedkube | `pkg/pillar/cmd/zedkube/zedkube.go` | KubeVirt API interaction |
+| kubevirt hypervisor | `pkg/pillar/hypervisor/kubevirt.go` | `Setup()`, `CreateReplicaVMIConfig()`, `CreateReplicaPodConfig()` |
+| kubeapi | `pkg/pillar/kubeapi/vitoapiserver.go` | `CreatePVC()`, PVC management |
+| kubeapi | `pkg/pillar/kubeapi/longhorninfo.go` | Longhorn queries |


The "Key Source Files" table also starts each row with || (double pipe), which introduces an empty first column in GitHub Markdown. Please change to a single leading | for proper rendering.

Copilot · 2026-03-06T11:33:02Z

pkg/kube/CLUSTER-INIT-FLOW.md

+            registration-utils.sh
+            utils.sh
+            kubevirt-utils.sh
+            tie-breaker-utils.sh


The sourced-library list is missing vnc-proxy.sh, but cluster-init.sh sources it (. /usr/bin/vnc-proxy.sh). Please add it here so the diagram matches the script's actual dependencies.

Suggested change

tie-breaker-utils.sh

tie-breaker-utils.sh

vnc-proxy.sh — VNC proxy helpers

Copilot · 2026-03-06T11:33:02Z

pkg/kube/CLUSTER-INIT-FLOW.md

+## Section 3c: check_and_run_vnc()
+
+```
+check_and_run_vnc()   (called each main loop iteration)
+       │
+       ├─ VMICONFIG_FILENAME (/run/zedkube/vmiVNC.run) exists
+       │   AND (VNC not running OR process dead)?
+       │     Parse file for VMINAME and VNCPORT
+       │     nohup /usr/bin/virtctl vnc $vmiName -n eve-kube-app
+       │             --port $vmiPort --proxy-only &
+       │     VNC_RUNNING=true
+       │
+       └─ VMICONFIG_FILENAME does NOT exist:
+            VNC_RUNNING==true? → kill virtctl vnc process
+            VNC_RUNNING=false
+```


This section describes check_and_run_vnc() called from the main loop and references /run/zedkube/vmiVNC.run, but cluster-init.sh actually starts monitor_vnc_config & (from vnc-proxy.sh) which watches /run/edgeview/VncParams/vmiVNC.run via inotify. Please update this section (function name, invocation model, and config-file path) to match the current implementation.

Copilot · 2026-03-06T11:33:02Z

pkg/kube/CLUSTER-INIT-FLOW.md

+| Flag File | Meaning |
+|-----------|---------|
+| `/var/lib/all_components_initialized` | All K3s/KubeVirt/Longhorn/Multus installed ✅ |
+| `/var/lib/k3s_installed_unpacked` | K3s binary is available |
+| `/var/lib/edge-node-cluster-mode` | Node is in cluster (not single-node) mode |
+| `/var/lib/multus_initialized` | Multus daemonset applied |
+| `/var/lib/kubevirt_initialized` | KubeVirt + CDI installed |
+| `/var/lib/longhorn_initialized` | Longhorn installed and ready |
+| `/var/lib/debuguser-initialized` | Debug user certificates/roles applied |
+| `/var/lib/node-labels-initialized` | node-uuid and Longhorn labels applied |
+| `/var/lib/base-k3s-mode` | Node is in base K3s mode (no Longhorn/KubeVirt) |
+| `/var/lib/convert-to-single-node` | Pending conversion back to single-node (triggers restore on next boot) |
+| `/var/lib/transition-to-cluster` | Non-bootstrap join in progress (contains timestamp + reboot_count) |
+| `/run/kube/cluster-change-wait-ongoing` | Blocks check_start_k3s during cluster join |
+| `/tmp/cluster_transition_flag` | Blocks check_start_k3s until transition pipe signaled |
+| `/tmp/cluster_transition_pipe$$` | FIFO pipe to coordinate k3s restart after transition |
+


The markdown tables in this section start rows with || (double pipe), which renders as an extra empty column in most Markdown renderers. Please change these table rows to start with a single | so the tables render correctly.

Copilot · 2026-03-06T11:33:03Z

docs/EVE-K-TIEBREAKER-FAILOVER.md

+  │ (leader) │     │  ████    │     │(tie-brkr)│
+  │          │     │  CRASH   │     │          │
+  │ [App-1]  │     │ [App-2]  │     │  (etcd)  │
+  │ Running  │     │Terminatng│     │          │


Typo in the diagram: Terminatng should be Terminating.

Suggested change

│ Running │ │Terminatng│ │ │

│ Running │ │Terminating│ │ │

Copilot · 2026-03-06T11:33:03Z

docs/EVE-K-TIEBREAKER-FAILOVER.md

+          │
+  t=180s  Lease renewal deadline exceeded
+          │  Node B zedkube: attempts to acquire Lease
+          │  Node C: not eligible (tie-breaker, no scheduling)


This states the tie-breaker node is "not eligible" for the zedkube leader election, but the current leader-election code does not exclude tie-breaker/non-worker nodes (it uses a standard LeaseLock in eve-kube-app and any node running zedkube can acquire it). Please adjust the wording to reflect actual behavior, or document the specific mechanism that prevents the tie-breaker from participating (if one exists).

Suggested change

│ Node C: not eligible (tie-breaker, no scheduling)

│ Node C: tie-breaker only (no workloads; in this scenario it does not participate in this Lease)

milan-zededa · 2026-03-06T11:38:21Z

docs/EVE-K-APP-DEPLOYMENT.md

+
+---
+
+## Networking Path (zedrouter → Multus → eve-bridge)


Here we could add link to more detailed documentation: https://github.com/lf-edge/eve/blob/master/pkg/kube/eve-bridge/README.md

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

milan-zededa

I don't think these AI-generated diagrams belong under /docs.

The /docs directory is typically used for user-oriented documentation of EVE features, without going too deep into implementation details. For developer-focused material, we usually place Markdown files with implementation details deeper in the repository, for example under pkg/pillar/docs.

Another issue with the current documentation is that it contains only diagrams without any explanation. For someone outside the core EVE team, the diagrams alone don't provide enough context to understand what they represent.

Instead, the documentation under /docs should focus on describing what features EVE-k provides, along with their limitations, configuration, and use-cases.

milan-zededa · 2026-03-06T11:56:40Z

It’s actually quite disappointing that this new and promising EVE-K project currently has only a single short documentation file (https://github.com/lf-edge/eve/blob/master/docs/EVE-K.md) and it doesn’t properly explain what EVE-K actually is or how it can be useful.

And now the proposal is to address that gap with a set of AI-generated diagrams, without any accompanying explanation. That doesn’t really solve the underlying problem, it just adds more content that lacks context and clarity. What the project really needs is clear, structured documentation explaining what EVE-K is, what problems it solves, how it works at a high level, how to configure it, and what its limitations are.

rucoder · 2026-03-06T14:06:57Z

@zedi-pramodh you can tell AI agent to run linter locally and fix comments

andrewd-zededa

Scenario 1's failover sequence may need some more detail

andrewd-zededa · 2026-03-11T16:02:45Z

docs/EVE-K-3NODE-FAILOVER.md

+  t=90s  ReplicaSet evicts App-3 pod (Terminating)
+         │
+  t=120s zedkube (leader, Node A):
+         │  pod terminating >2min → DetachOldWorkload()
+         │  ├─ Remove virt-launcher finalizers
+         │  ├─ Delete PVC attachment on Node C
+         │  └─ Force VMI termination


@zedi-pramodh I think this may be missing a step in between, where the new pod goes to scheduling.

zedi-pramodh requested a review from eriknordmark as a code owner March 5, 2026 18:53

github-actions bot requested review from andrewd-zededa and naiming-zededa March 5, 2026 19:01

eriknordmark reviewed Mar 5, 2026

View reviewed changes

rene requested review from Copilot March 6, 2026 11:24

Copilot started reviewing on behalf of rene March 6, 2026 11:25 View session

rene reviewed Mar 6, 2026

View reviewed changes

Copilot AI reviewed Mar 6, 2026

View reviewed changes

milan-zededa reviewed Mar 6, 2026

View reviewed changes

Copilot AI reviewed Mar 6, 2026

View reviewed changes

milan-zededa requested changes Mar 6, 2026

View reviewed changes

andrewd-zededa reviewed Mar 11, 2026

View reviewed changes

	tie-breaker-utils.sh
	tie-breaker-utils.sh
	vnc-proxy.sh — VNC proxy helpers

	│ Running │ │Terminatng│ │ │
	│ Running │ │Terminating│ │ │

	│ Node C: not eligible (tie-breaker, no scheduling)
	│ Node C: tie-breaker only (no workloads; in this scenario it does not participate in this Lease)

Conversation

zedi-pramodh commented Mar 5, 2026

How to test and validate this PR

Changelog notes

PR Backports

Checklist

Uh oh!

eriknordmark Mar 5, 2026

Choose a reason for hiding this comment

Uh oh!

rene Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

rene Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

rene Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

milan-zededa Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

milan-zededa left a comment

Choose a reason for hiding this comment

Uh oh!

milan-zededa commented Mar 6, 2026

Uh oh!

rucoder commented Mar 6, 2026

Uh oh!

andrewd-zededa left a comment

Choose a reason for hiding this comment

Uh oh!

andrewd-zededa Mar 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!