-
Notifications
You must be signed in to change notification settings - Fork 11
Switch IPPool agent to Deployment with rolling updates #78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: davidepasquero <[email protected]>
834936b to
25004a3
Compare
|
Leader election between pods has not yet been implemented so replication must remain at 1, but this already solves our problems when an agent is on a node that crashes |
|
Related Issues IPPool agent: detailed change descriptionThis document explains the changes introduced in the latest commit that Context and goal
Main changes (overview)
API and CRD
IPPool controller: flow and logic
Deployment spec: construction details
RBAC
Codegen and generated controllers
Tests
Compatibility and operational notes
Main touched files
|
IPPool Agent Deployment With Rolling Updates
This enhancement replaces the per-IPPool agent Pod with a Deployment to
fix a crash recovery gap: when a node crashes, the single agent Pod can
remain stuck in "Terminating" on that node, preventing rescheduling and
causing a DHCP service disruption. A Deployment removes the single-name
constraint and allows a new agent Pod to be scheduled on a healthy node
while the old Pod is still terminating.
Summary
This enhancement migrates IPPool agent management from a standalone Pod to
an apps/v1 Deployment with a RollingUpdate strategy. The change is driven
by a critical reliability issue in the current design: after a node crash,
the agent Pod can stay in a stuck Terminating state on the failed node,
blocking rescheduling and requiring manual cleanup. By managing the agent
through a Deployment, the controller can create a replacement Pod on a
healthy node even if the old Pod is stuck, while also tracking the
Deployment in
status.agentDeploymentRefand reporting readiness based onObservedGeneration, UpdatedReplicas, and AvailableReplicas.
Related Issues
harvester/harvester#8205
Motivation
The previous agent management via single Pods had a critical flaw: in the
event of a node crash, the agent Pod would get stuck in a "Terminating"
state on that node. This prevented it from being rescheduled onto a
healthy node without manual intervention, causing a service disruption.
A Deployment provides the ability to create a replacement Pod when the
current one is unavailable, and gives clearer readiness signals for the
controller.
Goals
crash without manual cleanup.
(
maxUnavailable=0,maxSurge=1).available replicas).
Non-goals [optional]
Proposal
Create an apps/v1 Deployment for each IPPool agent and reconcile it in
place. The controller updates Deployment template fields when the desired
agent configuration changes and reports readiness through IPPool
conditions. If the Deployment reference is missing or stale, the
controller recreates it and updates
status.agentDeploymentRef. TheDeployment allows a new Pod to be scheduled even when the old Pod is
stuck terminating after a node crash.
User Stories
Story 1
Before: A node crash leaves the single agent Pod stuck in Terminating, and
the agent cannot be rescheduled without manual intervention, disrupting
DHCP service.
After: The Deployment creates a replacement Pod on a
healthy node, restoring service automatically.
API changes
status.agentDeploymentRef(namespace, name, image, UID) to theIPPool CRD.
Design
Implementation Overview
fields and tracking UID/image in
status.agentDeploymentRef.AgentReadybased on Deployment readiness.Test plan
reference, simulate a node failure, confirm the replacement Pod is
scheduled and conditions recover.
Upgrade strategy
updates status with
agentDeploymentRef.agents; the Deployment becomes the single managed resource going
forward, and ensures replacement Pods can be scheduled on node failure.
Note [optional]
maxUnavailable=0andmaxSurge=1to keep the DHCPagent available during image changes and node recovery.