Skip to content

Documents describing eve-k app deployment and failover#5658

Open
zedi-pramodh wants to merge 1 commit intolf-edge:masterfrom
zedi-pramodh:ai-generated-docs-for-failover
Open

Documents describing eve-k app deployment and failover#5658
zedi-pramodh wants to merge 1 commit intolf-edge:masterfrom
zedi-pramodh:ai-generated-docs-for-failover

Conversation

@zedi-pramodh
Copy link

These documents are generated using AI tools and seems correct. These documents cover the app deployment on eve-k, failover scenarios in 3 node and tie-breaker mode, and general overview of cluster-init.sh

How to test and validate this PR

Nothing to test. just documentation.

Changelog notes

None

PR Backports

Checklist

  • I've provided a proper description

  • I've added the proper documentation

  • I've checked the boxes above, or I've provided a good reason why I didn't
    check them.

Please, check the boxes above after submitting the PR in interactive mode.

These documents are generated using AI tools and seems correct.
These documents cover the app deployment on eve-k, failover scenarios in 3 node and tie-breaker mode, and general overview of cluster-init.sh

Signed-off-by: Pramodh Pallapothu <pramodh@zededa.com>

### Node roles vs. Tie-Breaker topology

```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that there is a markdownlint issue here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and in several other places throughout the docs....

┌───────────────────────────────────────────────────────────────┐
│ 3-Node Full Control-Plane Cluster │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are several misaligned | characters throughout the docs....

| File | Relevant Scenarios |
|------|--------------------|
| `pkg/pillar/types/clustertypes.go` | Config structure, TieBreakerNodeID field |
| `pkg/pillar/cmd/zedkube/failover.go` | 1, 2 (identical logic to tie-breaker) |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it could point only to the major packages instead of single files (e.g. pkg/pillar/cmd/zedkube, pkg/pillar/kubeapi, etc). It can get out of sync with the code very easy as it is now....

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds new documentation describing EVE-K app deployment, cluster failover behaviors (3-node and tie-breaker topologies), and a detailed flow diagram for pkg/kube/cluster-init.sh.

Changes:

  • Added an AI-generated, step-by-step flow/flag reference for cluster-init.sh.
  • Added tie-breaker and full 3-node failover scenario walkthroughs and decision trees.
  • Added an app deployment microservice/pipeline diagram and pubsub/source-file references.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 12 comments.

File Description
pkg/kube/CLUSTER-INIT-FLOW.md New detailed flow/flags/reboot summary for cluster-init.sh behavior.
docs/EVE-K-TIEBREAKER-FAILOVER.md New failover scenarios for 2+1 (tie-breaker) topology, timeouts, and code references.
docs/EVE-K-APP-DEPLOYMENT.md New end-to-end app deployment flow diagrams, pubsub table, and key file pointers.
docs/EVE-K-3NODE-FAILOVER.md New failover scenarios for full 3-node control-plane topology plus comparisons and config notes.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +678 to +691
| Parameter | Value | Configurable | Purpose |
|-----------|-------|-------------|---------|
| Node NotReady detection | ~60s | No (K8s default) | Mark node unavailable |
| Pod eviction after NotReady | ~30s | No (K8s default) | Start pod eviction |
| getKubePodsError threshold | 2 min | No (hardcoded) | Mark app not running |
| DetachOldWorkload trigger | 2 min | No (hardcoded) | Force volume detach |
| Lease LeaseDuration | 300s | No | Leader holds lease |
| Lease RenewDeadline | 180s | No | Renew before expiry |
| Lease RetryPeriod | 15s | No | Election retry interval |
| drainSkipK8sAPINotReachableTimeout | 300s | Yes | Skip drain if API down |
| KubernetesDrainTimeout | 24h | Yes | Max drain wait |
| Drain cordon retries | 10× / 5s | No | Cordon retry policy |
| Drain eviction retries | 5× / 300s | No | Drain retry policy |

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Timeout Reference table uses || at the start of rows, which will render as an extra empty column in many Markdown renderers. Please switch to a single leading | for each row (including the header and separator).

Copilot uses AI. Check for mistakes.
Comment on lines +183 to +194
| From | To | Message | Contains |
|------|----|---------|----------|
| zedagent | zedmanager, zedkube | `AppInstanceConfig` | App spec, volumes, nets, virtualization mode |
| zedmanager | volumemgr | `VolumeRefConfig` | Volume UUID, size, image ref |
| zedmanager | zedrouter | `AppNetworkConfig` | Network instance UUIDs, VIF list |
| zedmanager | domainmgr | `DomainConfig` | vCPUs, RAM, disks (PVC names), VifList, mode |
| volumemgr | zedmanager | `VolumeRefStatus` | PVC name as `ActiveFileLocation` |
| zedrouter | zedmanager | `AppNetworkStatus` | VIF assignments, MACs |
| domainmgr | zedmanager | `DomainStatus` | Running/Pending/Failed, metrics |
| nim | zedrouter, zedkube | `DeviceNetworkStatus` | Physical port assignments |
| zedkube | (reporting) | `ENClusterAppStatus` | Cluster-level app health |

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tables start with || (double pipe), which typically renders as an extra empty column. Please change the table rows to start with a single | so the Pubsub message table renders correctly in GitHub Markdown.

Copilot uses AI. Check for mistakes.
Comment on lines +601 to +612
| Parameter | Value | Configurable | Notes |
|-----------|-------|-------------|-------|
| Node NotReady detection | ~60s | No | K8s default |
| Pod eviction after NotReady | ~30s | No | K8s default |
| DetachOldWorkload trigger | 2 min | No | Same as tie-breaker |
| Lease LeaseDuration | 300s | No | Same as tie-breaker |
| Lease RenewDeadline | 180s | No | 2 candidates race in 3-master |
| Lease RetryPeriod | 15s | No | Same as tie-breaker |
| drainSkipK8sAPINotReachableTimeout | 300s | Yes | Same as tie-breaker |
| KubernetesDrainTimeout | 24h | Yes | Same as tie-breaker |
| Longhorn rebuild start | ~60-120s | No | After replica offline |

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Timeout Reference table rows start with ||, which renders as an extra empty column in many Markdown renderers. Please use a single leading | for the header, separator, and data rows.

Copilot uses AI. Check for mistakes.
Comment on lines +575 to +587
// pkg/pillar/types/clustertypes.go

type EdgeNodeClusterConfig struct {
ClusterID uuid.UUID // Cluster identifier
ClusterInterface string // Network interface for cluster
ClusterIPPrefix net.IPNet // Cluster IP
IsWorkerNode bool // This node runs workloads
BootstrapNode bool // First node to initialize cluster
TieBreakerNodeID UUIDandVersion // UUID of tie-breaker (UNSET = full 3-master)
JoinServerIP net.IP // Existing node to join
EncryptedClusterToken string // Bootstrap token
ClusterContext string // Context name in kubeconfig
}
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The EdgeNodeClusterConfig struct shown here does not match the current definition in pkg/pillar/types/clustertypes.go (e.g., ClusterID is UUIDandVersion, ClusterIPPrefix is *net.IPNet, encrypted token fields are represented via CipherToken, and TieBreakerNodeID is UUIDandVersion). Please update the snippet to match the current code or replace it with a link to the source to avoid future drift.

Suggested change
// pkg/pillar/types/clustertypes.go
type EdgeNodeClusterConfig struct {
ClusterID uuid.UUID // Cluster identifier
ClusterInterface string // Network interface for cluster
ClusterIPPrefix net.IPNet // Cluster IP
IsWorkerNode bool // This node runs workloads
BootstrapNode bool // First node to initialize cluster
TieBreakerNodeID UUIDandVersion // UUID of tie-breaker (UNSET = full 3-master)
JoinServerIP net.IP // Existing node to join
EncryptedClusterToken string // Bootstrap token
ClusterContext string // Context name in kubeconfig
}
// For the latest definition of EdgeNodeClusterConfig, see:
// https://github.com/lf-edge/eve/blob/main/pkg/pillar/types/clustertypes.go

Copilot uses AI. Check for mistakes.
Comment on lines +199 to +212
| Agent | Path | Key Function |
|-------|------|-------------|
| zedagent | `pkg/pillar/cmd/zedagent/zedagent.go` | Controller config reception |
| zedmanager | `pkg/pillar/cmd/zedmanager/zedmanager.go` | Orchestration hub |
| zedmanager | `pkg/pillar/cmd/zedmanager/handledomainmgr.go` | `MaybeAddDomainConfig()` |
| zedmanager | `pkg/pillar/cmd/zedmanager/handlevolumemgr.go` | Publishes VolumeRefConfig |
| volumemgr | `pkg/pillar/cmd/volumemgr/blob.go` | PVC creation for HV=k |
| zedrouter | `pkg/pillar/cmd/zedrouter/zedrouter.go` | Network instance + CNI setup |
| zedrouter | `pkg/pillar/cmd/zedrouter/cni.go` | eve-bridge RPC server |
| domainmgr | `pkg/pillar/cmd/domainmgr/domainmgr.go` | HV=k detection, hypervisor call |
| zedkube | `pkg/pillar/cmd/zedkube/zedkube.go` | KubeVirt API interaction |
| kubevirt hypervisor | `pkg/pillar/hypervisor/kubevirt.go` | `Setup()`, `CreateReplicaVMIConfig()`, `CreateReplicaPodConfig()` |
| kubeapi | `pkg/pillar/kubeapi/vitoapiserver.go` | `CreatePVC()`, PVC management |
| kubeapi | `pkg/pillar/kubeapi/longhorninfo.go` | Longhorn queries |
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "Key Source Files" table also starts each row with || (double pipe), which introduces an empty first column in GitHub Markdown. Please change to a single leading | for proper rendering.

Copilot uses AI. Check for mistakes.
registration-utils.sh
utils.sh
kubevirt-utils.sh
tie-breaker-utils.sh
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sourced-library list is missing vnc-proxy.sh, but cluster-init.sh sources it (. /usr/bin/vnc-proxy.sh). Please add it here so the diagram matches the script's actual dependencies.

Suggested change
tie-breaker-utils.sh
tie-breaker-utils.sh
vnc-proxy.sh — VNC proxy helpers

Copilot uses AI. Check for mistakes.
Comment on lines +471 to +486
## Section 3c: check_and_run_vnc()

```
check_and_run_vnc() (called each main loop iteration)
├─ VMICONFIG_FILENAME (/run/zedkube/vmiVNC.run) exists
│ AND (VNC not running OR process dead)?
│ Parse file for VMINAME and VNCPORT
│ nohup /usr/bin/virtctl vnc $vmiName -n eve-kube-app
│ --port $vmiPort --proxy-only &
│ VNC_RUNNING=true
└─ VMICONFIG_FILENAME does NOT exist:
VNC_RUNNING==true? → kill virtctl vnc process
VNC_RUNNING=false
```
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section describes check_and_run_vnc() called from the main loop and references /run/zedkube/vmiVNC.run, but cluster-init.sh actually starts monitor_vnc_config & (from vnc-proxy.sh) which watches /run/edgeview/VncParams/vmiVNC.run via inotify. Please update this section (function name, invocation model, and config-file path) to match the current implementation.

Copilot uses AI. Check for mistakes.
Comment on lines +660 to +676
| Flag File | Meaning |
|-----------|---------|
| `/var/lib/all_components_initialized` | All K3s/KubeVirt/Longhorn/Multus installed ✅ |
| `/var/lib/k3s_installed_unpacked` | K3s binary is available |
| `/var/lib/edge-node-cluster-mode` | Node is in cluster (not single-node) mode |
| `/var/lib/multus_initialized` | Multus daemonset applied |
| `/var/lib/kubevirt_initialized` | KubeVirt + CDI installed |
| `/var/lib/longhorn_initialized` | Longhorn installed and ready |
| `/var/lib/debuguser-initialized` | Debug user certificates/roles applied |
| `/var/lib/node-labels-initialized` | node-uuid and Longhorn labels applied |
| `/var/lib/base-k3s-mode` | Node is in base K3s mode (no Longhorn/KubeVirt) |
| `/var/lib/convert-to-single-node` | Pending conversion back to single-node (triggers restore on next boot) |
| `/var/lib/transition-to-cluster` | Non-bootstrap join in progress (contains timestamp + reboot_count) |
| `/run/kube/cluster-change-wait-ongoing` | Blocks check_start_k3s during cluster join |
| `/tmp/cluster_transition_flag` | Blocks check_start_k3s until transition pipe signaled |
| `/tmp/cluster_transition_pipe$$` | FIFO pipe to coordinate k3s restart after transition |

Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The markdown tables in this section start rows with || (double pipe), which renders as an extra empty column in most Markdown renderers. Please change these table rows to start with a single | so the tables render correctly.

Copilot uses AI. Check for mistakes.
│ (leader) │ │ ████ │ │(tie-brkr)│
│ │ │ CRASH │ │ │
│ [App-1] │ │ [App-2] │ │ (etcd) │
│ Running │ │Terminatng│ │ │
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the diagram: Terminatng should be Terminating.

Suggested change
│ Running │ │Terminatng│ │ │
│ Running │ │Terminating│ │ │

Copilot uses AI. Check for mistakes.
t=180s Lease renewal deadline exceeded
│ Node B zedkube: attempts to acquire Lease
│ Node C: not eligible (tie-breaker, no scheduling)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This states the tie-breaker node is "not eligible" for the zedkube leader election, but the current leader-election code does not exclude tie-breaker/non-worker nodes (it uses a standard LeaseLock in eve-kube-app and any node running zedkube can acquire it). Please adjust the wording to reflect actual behavior, or document the specific mechanism that prevents the tie-breaker from participating (if one exists).

Suggested change
│ Node C: not eligible (tie-breaker, no scheduling)
│ Node C: tie-breaker only (no workloads; in this scenario it does not participate in this Lease)

Copilot uses AI. Check for mistakes.

---

## Networking Path (zedrouter → Multus → eve-bridge)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we could add link to more detailed documentation: https://github.com/lf-edge/eve/blob/master/pkg/kube/eve-bridge/README.md

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Contributor

@milan-zededa milan-zededa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think these AI-generated diagrams belong under /docs.

The /docs directory is typically used for user-oriented documentation of EVE features, without going too deep into implementation details. For developer-focused material, we usually place Markdown files with implementation details deeper in the repository, for example under pkg/pillar/docs.

Another issue with the current documentation is that it contains only diagrams without any explanation. For someone outside the core EVE team, the diagrams alone don't provide enough context to understand what they represent.

Instead, the documentation under /docs should focus on describing what features EVE-k provides, along with their limitations, configuration, and use-cases.

@milan-zededa
Copy link
Contributor

It’s actually quite disappointing that this new and promising EVE-K project currently has only a single short documentation file (https://github.com/lf-edge/eve/blob/master/docs/EVE-K.md) and it doesn’t properly explain what EVE-K actually is or how it can be useful.

And now the proposal is to address that gap with a set of AI-generated diagrams, without any accompanying explanation. That doesn’t really solve the underlying problem, it just adds more content that lacks context and clarity. What the project really needs is clear, structured documentation explaining what EVE-K is, what problems it solves, how it works at a high level, how to configure it, and what its limitations are.

@rucoder
Copy link
Contributor

rucoder commented Mar 6, 2026

@zedi-pramodh you can tell AI agent to run linter locally and fix comments

Copy link
Contributor

@andrewd-zededa andrewd-zededa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scenario 1's failover sequence may need some more detail

Comment on lines +82 to +88
t=90s ReplicaSet evicts App-3 pod (Terminating)
t=120s zedkube (leader, Node A):
│ pod terminating >2min → DetachOldWorkload()
│ ├─ Remove virt-launcher finalizers
│ ├─ Delete PVC attachment on Node C
│ └─ Force VMI termination
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zedi-pramodh I think this may be missing a step in between, where the new pod goes to scheduling.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants