Skip to content

Conversation

@davidepasquero
Copy link

IPPool Agent Deployment With Rolling Updates

This enhancement replaces the per-IPPool agent Pod with a Deployment to
fix a crash recovery gap: when a node crashes, the single agent Pod can
remain stuck in "Terminating" on that node, preventing rescheduling and
causing a DHCP service disruption. A Deployment removes the single-name
constraint and allows a new agent Pod to be scheduled on a healthy node
while the old Pod is still terminating.

Summary

This enhancement migrates IPPool agent management from a standalone Pod to
an apps/v1 Deployment with a RollingUpdate strategy. The change is driven
by a critical reliability issue in the current design: after a node crash,
the agent Pod can stay in a stuck Terminating state on the failed node,
blocking rescheduling and requiring manual cleanup. By managing the agent
through a Deployment, the controller can create a replacement Pod on a
healthy node even if the old Pod is stuck, while also tracking the
Deployment in status.agentDeploymentRef and reporting readiness based on
ObservedGeneration, UpdatedReplicas, and AvailableReplicas.

Related Issues

harvester/harvester#8205

Motivation

The previous agent management via single Pods had a critical flaw: in the
event of a node crash, the agent Pod would get stuck in a "Terminating"
state on that node. This prevented it from being rescheduled onto a
healthy node without manual intervention, causing a service disruption.
A Deployment provides the ability to create a replacement Pod when the
current one is unavailable, and gives clearer readiness signals for the
controller.

Goals

  • Ensure the IPPool agent can be rescheduled automatically after a node
    crash without manual cleanup.
  • Use a Deployment per IPPool agent with RollingUpdate
    (maxUnavailable=0, maxSurge=1).
  • Track the backing Deployment in IPPool status, including image and UID.
  • Gate AgentReady on Deployment readiness (ObservedGeneration, updated and
    available replicas).
  • Keep agent configuration via existing Helm values and controller flags.
  • Update RBAC and generated clients to manage Deployment resources.

Non-goals [optional]

  • Changing IP allocation, lease, or DHCP behavior.
  • Introducing a new user-facing API beyond the status reference.
  • Reworking agent networking or multus attachment behavior.

Proposal

Create an apps/v1 Deployment for each IPPool agent and reconcile it in
place. The controller updates Deployment template fields when the desired
agent configuration changes and reports readiness through IPPool
conditions. If the Deployment reference is missing or stale, the
controller recreates it and updates status.agentDeploymentRef. The
Deployment allows a new Pod to be scheduled even when the old Pod is
stuck terminating after a node crash.

User Stories

Story 1

Before: A node crash leaves the single agent Pod stuck in Terminating, and
the agent cannot be rescheduled without manual intervention, disrupting
DHCP service.
After: The Deployment creates a replacement Pod on a
healthy node, restoring service automatically.

API changes

  • Add status.agentDeploymentRef (namespace, name, image, UID) to the
    IPPool CRD.
  • Remove any dependency on Pod references in status.

Design

Implementation Overview

  • Add a Deployment builder for IPPool agents with RollingUpdate strategy.
  • Reconcile the Deployment in the IPPool controller, updating template
    fields and tracking UID/image in status.agentDeploymentRef.
  • Gate AgentReady based on Deployment readiness.
  • Update RBAC to manage Deployments.
  • Generate apps/v1 controllers and add a fake deployment client for tests.

Test plan

  • Unit tests for IPPool controller deployment reconciliation.
  • Manual: create an IPPool, verify the agent Deployment and status
    reference, simulate a node failure, confirm the replacement Pod is
    scheduled and conditions recover.

Upgrade strategy

  • On upgrade, the controller creates a Deployment for each IPPool and
    updates status with agentDeploymentRef.
  • Any legacy agent Pods should be removed to avoid running duplicate
    agents; the Deployment becomes the single managed resource going
    forward, and ensures replacement Pods can be scheduled on node failure.

Note [optional]

  • RollingUpdate uses maxUnavailable=0 and maxSurge=1 to keep the DHCP
    agent available during image changes and node recovery.

@davidepasquero
Copy link
Author

Leader election between pods has not yet been implemented so replication must remain at 1, but this already solves our problems when an agent is on a node that crashes

@davidepasquero
Copy link
Author

davidepasquero commented Dec 22, 2025

@starbops @votdev @yasker

Related Issues
harvester/harvester#8205

IPPool agent: detailed change description

This document explains the changes introduced in the latest commit that
moves IPPool agents from single Pods to Deployments with RollingUpdate,
including API, controller, RBAC, codegen, and tests.

Context and goal

  • The previous model used one Pod per IPPool; after a node crash the Pod could
    remain stuck in Terminating and would not be rescheduled automatically.
  • The new model uses a Deployment to guarantee self-recovery and controlled
    upgrades while keeping the agent behavior intact.

Main changes (overview)

  • Replaced status.agentPodRef with status.agentDeploymentRef in the CRD.
  • Added reconciliation logic for Deployments (create, update, monitor, cleanup).
  • Set RollingUpdate with maxUnavailable=0, maxSurge=1.
  • Updated RBAC, codegen, generated controllers, fake client, and tests.

API and CRD

  • pkg/apis/network.harvesterhci.io/v1alpha1/ippool.go

    • Removed AgentPodRef.
    • Added AgentDeploymentRef with DeploymentReference (namespace, name,
      image, uid).
  • chart/crds/network.harvesterhci.io_ippools.yaml

    • CRD schema now exposes status.agentDeploymentRef.
  • pkg/data/data.go

    • Bindata regenerated to include the updated CRD (byte array, size/modTime).

IPPool controller: flow and logic

  • pkg/controller/ippool/controller.go

    • The Handler now uses deploymentClient and deploymentCache (apps/v1).

    • Register watches Deployments labeled vm-dhcp-controller=agent and maps
      IPPools via labels ippool-namespace/ippool-name.

    • DeployAgent:

      • Computes desiredImage with getAgentImage (honors hold annotation).

      • If status already references a Deployment:

        • Verifies deletion timestamp, UID, and selector (immutable).
        • Compares and updates labels, strategy, replicas, template, affinity,
          init/containers, and annotations, calling Update only when needed.
      • If missing or not found, creates a new Deployment.

      • Updates status.agentDeploymentRef with namespace, name, uid, image.

    • MonitorAgent:

      • Validates existence, UID, image in the template, and readiness.
      • Readiness uses ObservedGeneration, UpdatedReplicas, and
        AvailableReplicas.
      • On mismatch, returns an error (does not delete the resource).
    • cleanup:

      • Deletes the associated Deployment instead of a Pod.
    • getAgentImage:

      • If hold-ippool-agent-upgrade is set and AgentDeploymentRef.Image
        is present, uses it; otherwise uses the default image.

Deployment spec: construction details

  • pkg/controller/ippool/common.go

    • New prepareAgentDeployment:

      • Labels: vm-dhcp-controller=agent, ippool-namespace, ippool-name.

      • Selector matches the same labels (immutable).

      • RollingUpdate with maxUnavailable=0, maxSurge=1.

      • PodTemplate with Multus annotation k8s.v1.cni.cncf.io/networks.

      • NodeAffinity for the cluster network (same as before).

      • Init container ip-setter and container agent with:

        • ImagePullPolicy=IfNotPresent.
        • Explicit TerminationMessagePath/Policy to avoid drift.
        • HTTP probes on /healthz and /readyz (port 8080).
    • Updated status builders and new helpers for Deployment.

    • deploymentBuilder provides DeploymentReady(true) to simulate
      ObservedGeneration, Updated/Available/ReadyReplicas in tests.

RBAC

  • chart/templates/rbac.yaml

    • Permissions moved from pods to deployments.
    • ClusterRole now includes get/list/watch for deployments.
    • Namespace Role now includes get/create/update/delete for deployments.
    • Role and RoleBinding renamed from *-pod-manager/manage-pods to
      *-deployment-manager/manage-deployments.

Codegen and generated controllers

  • pkg/codegen/main.go

    • Added apps/v1 with Deployment to controller codegen.
  • pkg/config/context.go

    • Added AppsFactory to management setup.
  • pkg/generated/controllers/apps/*

    • Generated controller/cache for Deployment (factory, interface, v1/deployment).
  • pkg/util/fakeclient/deployment.go

    • New fake client/cache for Deployment used by tests.
  • pkg/generated/clientset/versioned/typed/kubevirt.io/v1/*

    • Regenerated client: removed UpdateStatus for VMI, matching generator output.

Tests

  • pkg/controller/ippool/controller_test.go

    • Tests migrated from Pod to Deployment.
    • Updated cases for create/upgrade/hold, uid mismatch, readiness, image mismatch.
    • Uses prepareAgentDeployment and DeploymentReady(true) builders.

Compatibility and operational notes

  • agentPodRef is no longer read; legacy Pods are not managed by the controller.
  • RollingUpdate with maxSurge=1 can temporarily run a second agent during upgrade.
  • Deployment selector is treated as immutable; mismatch returns an error to avoid
    adopting the wrong workload.

Main touched files

  • pkg/apis/network.harvesterhci.io/v1alpha1/ippool.go
  • pkg/controller/ippool/common.go
  • pkg/controller/ippool/controller.go
  • pkg/controller/ippool/controller_test.go
  • pkg/util/fakeclient/deployment.go
  • pkg/codegen/main.go
  • pkg/config/context.go
  • chart/templates/rbac.yaml
  • chart/crds/network.harvesterhci.io_ippools.yaml
  • pkg/data/data.go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants