Skip to content

Node stays SchedulingDisabled after failed deployment due to missing uncordon... #4874

@eve-scality

Description

@eve-scality

Identified by automated analysis of ARTESCA-16355
Confidence: high

What needs to change

File: salt/metalk8s/orchestrate/deploy_node.sls
Version: 131.0.9
Add an onfail handler to deploy_node.sls that uncordons the node when any step in the orchestration fails, following the same pattern already used in upgraded.sls:

# Add after the 'Uncordon the node' state in deploy_node.sls
Uncordon the node on failure:
  metalk8s_cordon.node_uncordoned:
    - name: {{ node_name }}
    - onfail:
      - salt: Run the highstate

Additionally, consider adding a pre-flight check in the ARTESCA installer's deploy_platform function (or in deployed.sls) to uncordon all nodes at the start of a redeploy. This provides defense-in-depth in case the onfail handler doesn't fire correctly:

# Add at the beginning of deployed.sls, before node deployments
Ensure all nodes are uncordoned before deployment:
  module.run:
    - artesca_kubernetes.run_kubectl_cmd:
      - cmd: "uncordon --selector=metalk8s.scality.com/version"
      - output: ""
    - require:
      - artesca: Ensure pods in kube-system namespace on {{ nodes.bootstrap }} are ready

The primary fix should be in MetalK8s (deploy_node.sls) since that's where the cordon/uncordon lifecycle is managed. The ARTESCA-level fix is a secondary safety net.

Root Cause

The root cause is in MetalK8s's deploy_node.sls Salt orchestration. During node deployment, the orchestration cordons the node (sets unschedulable=True) at line 140 before running the highstate. The uncordon step at line 313 has a hard require dependency on the highstate succeeding. If the highstate or any preceding step fails, the Salt orchestration aborts and the uncordon state is never executed, leaving the node permanently in SchedulingDisabled status.

In the reported scenario: (1) First deploy attempt runs deploy_node for all 3 nodes — each node is cordoned → highstate → uncordoned. (2) The deploy base step then fails at "Create Keycloak 'StorageManager' main role". (3) On redeploy, deploy platform re-runs deploy_node for each node. Each node is cordoned again (idempotent). If node-3's highstate encounters any issue during the re-deployment (possibly caused by partially-applied state from the first attempt), the orchestration fails and the uncordon step is skipped — leaving node-3 stuck in SchedulingDisabled.

This contrasts with the MetalK8s upgrade orchestration (upgraded.sls) which properly includes an onfail handler (lines 40-47) that uncordons all nodes when the upgrade fails. The deploy_node.sls orchestration has no equivalent error-handling mechanism.

Evidence

salt/metalk8s/orchestrate/deploy_node.sls
├── L140: Cordon the node: ...  — The node is cordoned (SchedulingDisabled) early in the deploy_node orchestration, before the highstate runs. This sets the node's spec.unschedulable=True in the Kubernetes API.
├── L285: Run the highstate: ...  — The highstate runs AFTER the cordon step and is a prerequisite for the uncordon. If this step fails for any reason, the uncordon will never execute.
└── L313: Uncordon the node: ...  — The uncordon step has a hard 'require' on 'Run the highstate'. In Salt, if a required state fails, dependent states are skipped entirely. There is NO onfail handler to uncordon on error — the node stays cordoned forever if the highstate fails.
salt/metalk8s/orchestrate/upgraded.sls
└── L40: # Automatically uncordon nodes on upgrade failure ...  — The UPGRADE orchestration correctly handles failures by using an 'onfail' handler to uncordon all nodes when the upgrade fails. This pattern is missing from deploy_node.sls, confirming it's a known concern that was addressed for upgrades but not for initial deploys.
installer/artesca_installer/salt/artesca/metalk8s/deployed.sls
└── L247: Deploy node {{ node }}: ...  — The ARTESCA installer calls MetalK8s deploy_node orchestration for each remote node. If this orchestration fails, the ARTESCA installer has no cleanup logic to uncordon the node either — there's no error recovery at either the MetalK8s or ARTESCA level.

Upstream Impact

High severity. When a 3-node ARTESCA deployment fails for any reason (including transient Keycloak errors), a node can be left permanently in SchedulingDisabled state. This prevents the user from successfully redeploying ARTESCA because: (1) pods cannot be scheduled on the affected node, (2) the same deploy_node sequence re-cordons the already-cordoned node without fixing the underlying issue. The user has no way to recover without manual intervention (running kubectl uncordon). This affects all multi-node ARTESCA deployments (the 3-node HA configuration is the primary deployment model).

Metadata

Metadata

Assignees

No one assigned

    Labels

    cerebro-analyzedIssue created by Cerebro automated analysis

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions